The Lake Is Not the Database. The Engine Is.

4 min read

Apr 26, 2026

There is a story being told right now, in conference halls and blog posts and architecture diagrams, that once data reaches a certain scale, the lakehouse becomes the default architecture. Data, once scattered across systems, now settles into object storage in open formats, accessible to whatever compute layer needs it.

From there, the next step feels almost inevitable. If everything lives in the lake, why not run everything directly on it? Why not let the lake become the database?

It is a compelling idea, and in the right conditions it works exactly as advertised. The trouble is that those conditions are more specific than the story tends to admit. Because the question is not whether data can be stored this way. It clearly can. The question is what happens when that data is no longer at rest but is actively queried, updated by applications, and expected to respond under load.

The Lake Is Not the Database. The Engine Is.

Storage Is Not Execution

Object storage has become the center of gravity for data platforms because it solves real problems. It is durable, scalable, and cost-efficient. Open table formats make it possible to organize that data and share it across tools.

But storage, no matter how well organized, does not behave like a database.

A database is defined by how it executes. How quickly it can answer a query. How it behaves under concurrency. How it handles updates and transactions. How predictable its latency is when the system is under pressure. These are not storage concerns. They emerge from the execution layer, from how the system actually processes data once it is asked to do something with it.

You can store data in Parquet files, layer metadata on top, and build pipelines that continuously feed new data into the system. What you cannot do is change the underlying characteristics of object storage, which is optimized for durability and scalability rather than for low-latency, highly concurrent execution.

Where a Storage-First Model Starts to Break

In practice, the system begins to show its trade-offs in fairly consistent ways.

Frequent writes generate large numbers of small files that require ongoing optimization, while queries must navigate expanding metadata and increasingly fragmented layouts. Fragmentation accumulates and the system depends on background processes to restore order. Query performance remains tied to how data is physically organized, and as concurrency increases, latency does not improve.

None of this is surprising. It is what happens when execution is layered on top of storage rather than built into it. To compensate, more layers get introduced:

  • Background optimization and compaction jobs are introduced to keep data organized.

  • Caching layers are added to reduce latency.

  • Separate systems emerge to serve operational workloads and extract value from the data, often moving data out of the lake into downstream databases for applications and dashboards.

At that point, the architecture is no longer defined by the lake itself, but by the execution layers around it. Some of these layers act as workarounds. Others become the system.

Where the Lakehouse Fits

The model works extremely well for large datasets where the primary concern is scale rather than immediate responsiveness. Logs, clickstream data, telemetry, and other append-heavy workloads fit naturally here, as do large collections of unstructured data that benefit from being stored once and accessed by many systems or archived.

In these environments, the lakehouse acts as a coordination layer. It allows data to exist once and be interpreted in multiple ways without constant duplication or movement. The economics are favorable, and the flexibility is real.

Every Lakehouse Needs an Engine

A lakehouse centralizes data and the execution engine determines how that data behaves when it is used. It is where most of the practical differences between systems begin to show up.

An engine designed for interactive workloads does not rely on compaction cycles or background correction to remain performant. It operates directly on data, handling ingestion, queries, and updates in a way that preserves low and predictable latency even as concurrency increases. It handles transactional and analytical workloads as part of the same system rather than forcing them into separate paths.

This is what allows a lakehouse to move beyond being a repository and become something that can support applications, real-time analytics, and AI systems without introducing layers to compensate for limitations.

Without that kind of engine, the lakehouse remains useful, but bounded. It is a place where data can live and be processed in bulk. With it, the same data can be acted on continuously, without waiting for the system to reorganize itself in the background before it becomes usable again.

Making Lakehouses Work

The future is not a single architecture that replaces everything else, and it is not a matter of choosing between lakehouses and databases. It is a matter of understanding what each layer is responsible for.

Open storage provides flexibility, control, and scale. It is a strong foundation and will continue to be adopted for exactly those reasons. Execution determines what is actually possible once the system is in use. It defines latency, concurrency, and how systems behave under real conditions.

The lake enables the architecture. The engine defines the system.


Share