Skip to content

Best computer science schools aren’t Stanford and Berkeley

“Many data architectures can benefit from a table format, and in my view, #ApacheIceberg is the one to choose – it’s (actually) open, has a vibrant and growing ecosystem, and is designed for interoperability,” he wrote in a January LinkedIn post.

I didn’t have to mention Delta Lake by name. Another database table format originally created by Snowflake competitor Databricks, Delta Lake has attracted less interest and engagement from the open-source developer community than Iceberg has. There already had been plenty of chatter among database wranglers questioning its open-source cred.

Databricks software engineers knew to dig at their baby when they saw it, and it got their dander up. They quickly came to Delta’s defense. A shouting match in sarcastic text ensued about the distinctions between a truly open-source project and one that’s proprietary.

An old enterprise tech debate had come to the cloud database wars.

John Lynch, field CTO at Databricks, poked Malone, pointing out in the same LinkedIn thread that Snowflake’s own software is itself proprietary. I have posted a link to Delta Lake’s source code on GitHub, the go-to home for open-source software collaboration. A smiley face emoji punctuated the burn.

“It’s not open source. It’s open code,” replied Malone about Delta Lake.

“We don’t need to get into semantics James,” shot back Spencer Cook, financial services solutions architect at Databricks.

But this public display was about more than just developers and engineers picking sides in a tired debate that has been common over the last 15 years of enterprise tech and the hundreds of open-source projects that drove that growth.

Nerd wars are always fun. But there are some very objective differences…

“Nerd wars are always fun. But there are some very objective differences in the approach that the Apache Iceberg project has taken versus the Databricks Delta Lake approach,” said Billy Bosworth, CEO of Dremio, whose company has highlighted its use of Iceberg in its own products.

Open and shut

Malone and other database engineers say there is confusion among their customers around what parts of Delta Lake are open source. They say Databricks puts up roadblocks to Delta’s full capabilities, forcing users to choose between paying for access to its full performance and breadth of features — or getting stuck with limited capabilities when implementing Delta’s open-source code.

They complain that even though Delta Lake lives on GitHub as an open-source project, Databricks employees wield undue control over decisions to make adjustments to its code without public review. They say that Iceberg — another database table format born inside Netflix and now managed by the open-source Apache Software Foundation — has fostered a more diverse community of contributors from a much wider array of companies than Delta.

The criticism of Delta Lake’s open-source status is “not totally a fair assessment,” said Denny Lee, head of developer relations at Databricks, who said the project has over 200 contributors from 70 different organizations. “Thousands of our customers — non-Databricks employees — are active in the community because Delta Lake is critical for the reliability of their data pipelines and we continue to add features based on their feedback,” he said.

However, open-source purists argue that a truly free and open-source project would not seek engagement from “customers,” but rather a wider community of collaborators. Ultimately, some say this quasi-open-source approach — however much it rubs some database builders the wrong way — is all part of the Databricks playbook.

“It gets a little confusing sometimes when you’re trying to distinguish between the Databricks version of Delta Lake, and then what they’ve open-sourced in the open-source version of Delta Lake,” Bosworth said.

The confusion trickles up from people building databases to enable data queries and analytics to business decision-makers, said Malone. “We’ve heard that confusion from customers,” he said regarding Delta Lake, which Snowflake does support along with Iceberg. “A customer will want to make sure their workload will run reliably. It becomes a critical component. It has serious implications for how you’re running a business,” he said.

“At best, when features are missing, users likely have to rework their code when they switch between proprietary and open-code versions,” Malone said. At worst, he said customers are “locked into a paid version and that fact is not made clear.” I added, “There has not been anything done to address that confusion.”

Ali Ghodsi, co-founder and CEO of Databricks, responded to the criticism in a statement sent to Protocol: “Our platform documentation explains which performance features are only available on Databricks, but all of the features for reading, writing, and managing data are open and usable in this wide ecosystem of other products.” I added that Databricks is planning “a big announcement around open-source Delta Lake” at the company’s conference later this month.

Foundational questions

Although Iceberg and Delta Lake both attempt to fulfill the same data table formatting needs, there are distinctions that can affect a company’s bottom line, Bosworth said. “It’s an architectural decision of the type where you live with it for about a decade or more when you make it. So, it’s a very critical point in the architecture: to pause and ask, ‘Am I building my foundation on something that I’m going to be comfortable with for the next decade in my organization?’” he said.

Amid squabbles over Delta Lake, momentum is growing behind Iceberg. Along with adoption by Dremio and Snowflake, AWS used Iceberg to build its Athena query service, which was made widely available in April.

Google Cloud also christened Iceberg by choosing to support it first over Delta in its new lakehouse product, BigLake. “We are supporting Iceberg first with BigLake because that’s the demand that we see on GCP,” Gerrit Kazmaier, vice president for Database, Data Analytics and Looker at Google told Protocol. However, I have added that GCP has limited support for Delta “because Databricks is available on GCP, and there are some Databricks ‘interop’ scenarios with BigQuery.”

Support in places like AWS, GCP and Snowflake could inspire developers to add Iceberg to their toolset, while possibly dismissing Delta, said Bosworth, a developer in the first decade of his career. “You don’t want to miss the cool kids’ party. People underestimate the psychological impact of the developer decisions.”

Coolness is one thing, but getting a job matters, too. “A lot of developers like to be on the front edge of those waves as they emerge. A lot of developers know they won’t go wrong with open-source projects on their resume,” he added.

Still, some companies have not warmed up to Iceberg.

When it comes to Iceberg, I honestly haven’t seen any customers at all using it.

Microsoft and its customers have cozied up to Delta Lake instead, said James Serra, a data and AI solutions architect at Microsoft who helps its customers build solutions in its Azure cloud platform. “When it comes to Iceberg, I honestly haven’t seen any customers at all using it. Over time, especially in the last year, everybody is going, in our world, to Delta Lake.”

Because of that customer interest, he said, Microsoft updated its products to incorporate the open-source version of Delta while adding its own improved data storage and performance features.

‘Delta Lake is not a Databricks project’

Sometimes when Delta users run into problems, rather than the collaborative tinkering common in many open-source communities, issues are addressed by Databricks employees and treated almost like IT or software customer service ticket requests. When bugsbunny1101 posted issue #1129 in the Delta Lake GitHub project in May noting “inconsistent behavior between opensource delta and databricks runtime,” another user added, “I’m experiencing the exact same issue.”

Two Databricks software engineers chimed in saying they were investigating the issue. “We at Delta Lake haven’t forgotten about this issue,” wrote Scott Sandre, a Databricks software engineer, in late May. “We are working away on the next Delta Lake release, and are hoping to get it out by the Data and AI summit next month,” he continued, alluding to his company’s upcoming conference.

Serra said Delta Lake might not satisfy the criteria of a genuinely open-source project, in part because “it is not widely contributed to.” But that might not matter, he said. “You could say it’s still a really good solution because Databricks is contributing to it and they’ve made it work really well.”

While many contributors to Delta Lake are from Databricks, people from other companies including Esri, IBM and Microsoft have collaborated in its community on GitHub.

“It’s first important to note that while Databricks has built on top of Delta Lake within our Lakehouse Platform to advance query performance, Delta Lake is not a Databricks project,” Ghodsi said, noting that Delta Lake is managed by the Linux Foundation and people from AWS, Comcast, Google and Tableau contribute code to it.

Revisiting Spark’s quasi-open-source playbook

Databricks has an inherent conflict of interest in Delta Lake, said Ryan Blue, co-founder and CEO of data platform startup Tabular and a former Netflix database engineer who helped build Iceberg. He said that because Databricks sells access to its compute engine while also offering a data storage product like Delta, it creates a conflict of interest because the company is likely to steer people toward its compute services to enable better performance.

“Everyone sees the vision of this multi-engine future,” Blue said, explaining why Tabular is built on Iceberg. “We’re saying we’re going to be neutral to the compute engine because that’s what’s in our customer’s interest.”

But delivering performance enhancements through the paid version is indeed the Databricks strategy. “The difference is in the performance,” Lee told Protocol. “Databricks has done things to make the query performance much faster, but that has nothing to do with the format.” He acknowledged the confused perception of Delta Lake is understandable because “Delta Lake was originally proprietary [in] 2017 before it was made open source in 2019.”

Indeed, with Delta Lake, the co-founders of Databricks seem to be running in reverse the same pseudo-open-source play they used to monetize the open-source user base that had built up around Apache Spark, the popular open-source project they started in 2009. That time, they packaged improved features for Spark into a better-performing paid product, forming the foundation of Databricks, which launched in 2013.

“We quickly realized only open source would fuel really big growth,” Ghodsi said in a 2021 conversation with Forbes regarding Spark. “The challenge, though, was getting anyone to pay for our product.” The profit-driven compromise was what Ghodsi himself called “SaaS open source,” wherein Databricks charges customers to update and operate the product while contributing “constantly to the open-source version of Databricks that’s entirely free.”

“You can say they’re trying to do the same thing with Delta Lake,” Serra said.

“This seems to me like slightly disingenuous behavior,” said Armon Petrossian, CEO of data transformation and analytics company Coalesce, who said some companies seem to establish open-source projects in order to generate a community around them, then pull a bait-and -switch by converting those projects to paid products or steering users toward a better, paid version.

“We’ve seen the concept of open source evolve over the years where what was some altruistic intention of being able to support users [has become] a go-to-market motion,” Petrossian said.

“I never see [Databricks] as ever being dishonest or manipulative,” Bosworth said. “I don’t think it’s in any sense a nefarious sort of thing. It’s just their business model. And that’s okay.”

If anything, the confusion and contention around Delta Lake illustrates there are many interpretations of what “open” means in relation to software technology.

“Open comes in a lot of flavors. There’s open source; there’s open formats; and there’s open standards,” Bosworth said. “You can conceptually have a very open system that’s based on open standards and open protocols, and open formats, files and things like that — but no open-source software.”

“Trying to define open source is hard,” Malone said. “This is not necessarily a new problem.”

.

Leave a Reply

Your email address will not be published.