Showdown at the Lakehouse: Databricks Muscles Up With Tabular

A data warehouse is where you store structured historic data.

Most organizations, once buying into the value of Big Data, found that all their information couldn’t be confined in a relational structure. So this need led to the data lake, where you put all your unstructured object data, hopefully, to fish useful bits out of it later.

And now as generative AI apps hunger for even more data, an emerging format, called the data lakehouse, has emerged to keep both structured and unstructured data in the same location, offering ACID-level transactions on object storage. It is natural for streaming data, AI modeling and training, and other new workloads.

With a lakehouse, a retail company, to provide an example, could combine weather forecasts with user buying data to better stock shelves with the seasonally appropriate items customers want, for instance.

Databricks Buys Tabular

Now, the originator of the lakehouse concept, Databricks wants to unify the field, building an “open data lakehouse.” So last month it purchased the data management company Tabular.

With this reportedly US $1 billion purchase, the company plans to unify the two most popular formats for lakehouses, Apache Iceberg and Databricks’ own Delta Lake.

Tabular was founded by Ryan Blue, Daniel Weeks, who created Apache Iceberg while they were working at Netflix — and Jason Reid.

“A bit part of [the acquisition] is having the creators of Iceberg in the company,” said Adam Conway, senior vice president of products at Databricks, in a phone interview with TNS.

The company will work to steer the two projects closer together, he said.

There are some slight differences between the two formats. Delta is really well suited for streaming workloads, whereas Iceberg is built on a strong data catalog, which provides many data management capabilities.

“Our goal is just to make it so that doesn’t matter really which one you choose,” Conway said.

The Emergence of the Lakehouse

About a decade ago, Databricks recognized an emergent behavior of “people using their data lakes as data warehouses,” Conway said.

The beauty of the lakehouse was that it was an architecture that allowed the user to pick the best analytics engine for the job — as long as the data was in an open format.

This approach would disrupt the traditional data warehouse vendors — Google‘s BigQuery, Amazon Web Services‘ Redshift, Teradata, and Snowflake — which built business models around storing data on their own proprietary systems.

“It’s part of their business, that lock-in,” Conway said.

In response, Databricks developed its own open source lakehouse Delta Lake, which was subsequently donated to the Linux Foundation.

The Delta Lake format was open in that it could be used by open source analytics engines such as (primarily) Apache Spark, but also others like Trino and Presto. Thus far, over 10,000 companies globally use Delta Lake (it has a larger user base than Iceberg, Conway argued). It used to process over 4EB of data on average each day.

Despite the fact that both Iceberg and Delta Lake use the underlying Apache Parquet columnar data format, and they both offer largely the same functionality, the development of each format progressed independently, and so they have been largely incompatible. “They had different features but they were done in different ways,” Conway said.

Customers had started using both formats, though in many cases, the deployments were happening in different parts of the organization, which defeated the point of a unifying lakehouse altogether.

“Unfortunately, the lakehouse paradigm has been split between the two most popular formats,” admitted Ali Ghodsi, co-founder and CEO at Databricks, in a statement.

Apache and Delta Lake, Together Forever

Tabular also has some cool user-focused features that Databricks users will no doubt enjoy. One in particular is the data catalog format.

But the overall goal is clear: Databricks wants to get both formats as compatible with each other as possible.

The Tabular folks can also help in the development of Delta Lake UniForm, a Databricks open source format that allows users to read both Iceberg and Delta Lake formats. Databricks opened UniForm to general availability during its user conference last month in San Francisco.

Credit: Databricks

Set the Data Free

The timing of the Tabular acquisition announcement fell on the same week as the Snowflake‘s own annual user conference. There, Snowflake announced that it would support Iceberg Tables as a format, and users could either store that data in Snowflake or on their own servers. It also launched its own open source data catalog, Polaris, which could index data from non-Snowflake sources and be used by any analytics engine, not just Snowflake’s.

Like most other data warehouses, Snowflake keeps the user data mostly on its own servers. The company’s move top bring-your-own-storage is a validation of the lakehouse format, Conway argued.

This bring-your-own-storage seems to be tracking with current industry best practices.

At the Aerospike‘s virtual Real-time data Summit last month, Google Vice President Of Engineering Sameet Agarwal had discussed the importance of disconnecting storage and compute.

Storage should be the global foundation of any business, he said. A uniform data format spans even across data centers and should combine hot, warm and cold workloads in the same storage system.

And the costs and scalability of data storage should not be yoked to that of computer power, and it must be scalable. “It’s very important that the cost of managing the system does not scale linearly with the amount of storage as the amount of storage grows, the cost of management cannot grow linearly with it,” he advised.

All of this leads to why cloud storage is the best option for data lakehouses, he said.

“We think of its evolution from a data lakehouse to an AI lakehouse,” Agarwal said. “We want the Data AI Lakehouse to be a single source of truth, not just for structured and semi-structured data, but also unstructured data.”

The post Showdown at the Lakehouse: Databricks Muscles Up With Tabular appeared first on The New Stack.

By acquiring Tabular, Databricks can combine Apache Iceberg expertise with its own Delta Lake format, and promises to unify the increasingly fragmented market for data lakehouses.

Showdown at the Lakehouse: Databricks Muscles Up With Tabular

Databricks Buys Tabular

The Emergence of the Lakehouse

Apache and Delta Lake, Together Forever

Set the Data Free

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...