Real-Time Data Platforms: SingleStore vs Databricks

June 2025 update: Hi, we wrote this blog comparing Databricks' real-time capabilities with SingleStore in Nov. 2023, and wanted to update it in light of their recent Lakebase announcement. When you see the word "Cassandra" in this blog, replace it with "Neon." The blog is now up to date 🙂.

SingleStore and Databricks are both exceptional data platforms that address important challenges for their customers.

However, when it comes to performance and cost, SingleStore has several, major advantages because it’s built from the ground up for performance, which ends up leading to lower cost. This blog is the first of a multi-part series in which we will examine these differences, and we will begin on the subject of real-time analytics and operations, an area in which SingleStore excels.

Additionally, we have observed that SingleStore also has cost and performance advantages in non real-time, batch ETL jobs — and we will cover those in a follow up blog.

Understanding the value of real-time data

To begin, let's establish the significance of real-time data. Why do customers value it? The simple answer is in many use cases, the value of data diminishes as it ages. Whether you're optimizing a marketing campaign, monitoring trade speeds, pushing real-time inventory updates, observing network hiccups or watching security events, delays in customer reactions translate to financial losses. The events generated by these sources arrive continuously — in a stream — which has led to the rise of streaming technologies. Databricks' recent blog, "Latency goes subsecond in Apache Spark Structured Streaming," aptly describes this:

“In our conversations with many customers, we have encountered use cases that require consistent sub-second latency. Such low latency use cases arise from applications like operational alerting and real time monitoring, a.k.a "operational workloads."

At SingleStore, we deal in milliseconds, because that’s what matters to our customers. Let’s call this quality latency, and define it as the time it takes for one event to enter the platform, reach its destination and generate value. There are other important factors to consider, and Databricks correctly points out two more in their blog which describes “give[ing] users the flexibility to balance the tradeoff between throughput, cost and latency”. We’ll add two more, simplicity and availability, to complete our goals for the ideal real time data platform:

Minimize latency
Maximize throughput
Minimize cost
Maximize availability
Maximize simplicity

How SingleStore handles real-time use cases

First, we’d like to discuss SingleStore’s recommended approach to real-time data use cases, which is to ingest streaming data into SingleStore and query it, illustrated in the following figure.

At this point you are probably thinking, huh? That’s it? There must be more to it than that! How could one data platform ingest in real time AND serve analytical queries without sacrificing real- time SLA? I hear companies talking about adding new, specialized streaming products all the time. What do they do?

How Databricks handles real-time use cases

As it turns out, Databricks is one such company. Let’s examine their approach in their recent blog, Latency goes subsecond in Apache Spark Structured Streaming, which includes two illustrations. In the first illustration,

“Analytical workloads typically ingest, transform, process and analyze data in real time and write the results into Delta Lake backed by object storage” [where it stops being real time]

That’s not the end of the story, as the blog also contains an entirely separate ‘operational workloads’ configuration. While this existence of this configuration is, by itself, compelling evidence the analytical workloads configuration stops being real time when it reaches Delta Lake, Databricks also pretty much admits this in their blog:

“On the other hand, operational workloads, ingest and process data in real time and automatically trigger a business process.” [that is also in real time]

The curious thing about this second figure is that it ends in a message bus. The data never lands and nothing ends up using it. Databricks solution for real time is to read from Kafka, do transformations and write back to either Kafka or…

“fast key value stores like Apache Cassandra or Redis for downstream integration to business process”

...or other databases! Why would a data platform company like Databricks tell their customers to store data in another database? Because those databases offer something that Databricks doesn’t: fast point reads and writes (CRUD). They use a key-value format to enable this capability, at the expense of analytical queries, which neither those databases nor Kafka can do easily and efficiently.

SingleStoreDB, with its patented Universal Storage, can do both transactional and analytical queries. In fact, SingleStore is more than the sum of Databricks and a key value store, since it provides a single SQL interface to perform reads and writes with:

High selectivity (OLTP, including CRUD)
Medium selectivity (real-time analytics) — only SingleStore can do this
Low selectivity (large scale analytics and bulk insert)

While this is certainly enough to explain why Databricks recommends Cassandra or Redis for real time, there is another compelling reason: SingleStore and those databases are more highly available than Databricks. SingleStore has automatic redundancy within the nodes of its clusters (Standard Edition) and even across availability zones with the push of a button (Premium Edition). Databricks, on the other hand, doesn’t have a page about high availability in its docs. Instead, Databricks talks about how AWS S3, a component of their system is highly available (which does not mean the whole system is highly available).

The absence of this feature explains the existence of this AWS deployment guide which describes how, with considerable effort, you can deploy Databricks clusters in two AZs, but note this is still not your cluster that is cross-AZ, it is just the existence of any clusters in two AZs. If you want your Databricks-powered app to be truly tolerant of an AZ failure, you are doing that yourself by configuring the above and changing your app talk to two clusters — both of which come at the price of a lot more effort, expense and complexity.

With all of this in mind, this illustration of Databricks’ proposal is a more complete representation of their proposed Rube Goldberg Machine — cough, we mean real-time data platform, along with its drawbacks.

Databricks' recommended configuration of operational streaming pipelines can be greatly simplified by replacing all of it with SingleStore, which is built for real time and requires only a single message bus for ingestion.

Option 3: Simple analytical queries, highly available and real time

How SingleStore works under the hood

Wondering how we do it? We’re glad you asked! Let’s take a deeper dive into the architecture that makes SingleStore a simple and performant platform for real-time analytics.

Streaming data originates from the source and events are ingested by SingleStore’s Pipelines, which are fully parallelized, and can read data from Kafka and a variety of other sources in many popular formats. Another possible source of real-time data is DML statements to insert, update, delete and upsert data. These can run with high throughput and concurrently with streaming ingest thanks to row level locking — which means that individual rows, rather than whole tables, can be locked for writes. This greatly increases the throughput of the end-to-end system.

Transformations can be applied with stored procedures, which can be called as the endpoints of pipelines in SingleStore and allow our customers to apply complex transformations to streaming data including filtering, joins, grouping aggregations and the ability to write into multiple tables. Since they can serve as pipeline endpoints, there’s a single partitioned writer working on batches of data, facilitating parallelism.

Here’s an example of a stored procedure that maintains a custom running SUM (or AVG) aggregation on grouped data from a pipeline containing CDC data (where the 'action' column may contain DELETED and INSERTED).

1CREATE PROCEDURE my_custom_sum (2    cdc QUERY(c1 INT, c2 TEXT, action VARCHAR)3)4AS5BEGIN6    INSERT INTO my_custom_mv7    SELECT8        col2,9        SUM( IF(action='DELETED', -col1, col1) ) AS sum,10        SUM( IF(action='DELETED', -1, 1) ) AS num_rows11    FROM cdc12    GROUP BY col213    HAVING sum != 0 OR num_rows != 014    ON DUPLICATE KEY UPDATE15        sum = sum + VALUES(sum),16        num_rows = num_rows + VALUES(num_rows);17    DELETE FROM my_custom_mv18    WHERE num_rows = 0;19END

After it’s transformed, data is written into Tier 1, which is the memory layer of the LSM Tree (the main data structure backing SingleStoreDB tables). These writes use a replicated Write Ahead Log (WAL) to persist to Tier 2, the local disk and persistence to Tier 3 is done lazily in the background — not on the latency critical path. The net result? The data becomes consistently queryable in single-digit milliseconds.

Key differences between SingleStore and Databricks architecture

Why can’t Databricks offer comparable real time capabilities? There are two main reasons:

For writes, Tiers 1 + 2 don’t exist
For reads, Tier 1 doesn’t exist and Tier 2 is off by default, harder to use and adds latency

Let’s examine the write path first. In SingleStore, writes arrive in Tier 1, the logs are written to Tier 2 and data is replicated throughout the system and instantly queryable. Contrast this with Databricks, where writes have to go all the way to the cloud object store before they are acknowledged.

The read path has similar limitations. In SingleStore, Universal Storage takes advantage of both Tiers 1 and 2, and purely in-memory rowstore tables can also be used for the maximum performance optimization. Compare this with Databricks, which famously stores nothing in its Spark memory layer — which is great, until you want to read really fast.

Further, Databricks’ disk layer is off by default and even when enabled, new data must first be ingested into the object store and only then pulled into the cache, adding a lot of latency. In SingleStore new data is written to disk on the way in, so it’s already there to be read when you need it.

Most importantly, Databricks knows it’s not possible to write to and read from the cloud object store with low latency — and they have designed their entire streaming architecture as a way to compensate for the absence of this capability.

Databricks recommends their users split their application into two parts, executed by completely different systems:

Pre-processing data with Spark Structured Streaming pipelines
Lighter weight queries over pre-processed data

However, the first system introduces delays and makes processing less real time, and the second still doesn’t deliver low enough latency for many scenarios.

SingleStore can do fast, low latency queries either over raw ingested data or pre-processed data in stored procedures that are endpoints of ingest pipelines. In the latter case, pre-processing is done in the same environment using SQL. This results in legitimately real-time processing.

Strengths of Databricks

Despite all of the above, streaming architectures that never touch a database do have their uses. For example, you might have a truly massive amount of data — more than would ever fit in storage — and you just want to make a few transforms to events in one Kafka stream, and re-emit another Kafka stream that triggers an alert.

Databricks has also made great advances in data exploration, and developers love the flexibility of their notebook interface. Furthermore, their product has a lot of advanced machine learning capabilities.

Databricks is also widely used to power ETL jobs, although SingleStore has some performance and cost advantages in this space, so some jobs might make more sense on SingleStore. We will cover this topic and the best ways to use the two products together in a future blog in this series.

Summary: Real-time data platforms: SingleStore vs. Databricks

For real-time use cases, Apache Spark Structured Streaming and another database is an overly complicated and impractical solution when you can simply ingest streaming data into SingleStore and query it.

Lower latency

SingleStore has an in-memory data tier for freshly ingested trickle inserts and updates, as well as faster access to metadata. This layer is absent in Databricks
SingleStore has row-level indexes found in operational systems and data formats supporting cheap seeks, while Databricks only supports redundant data structures that are used to prune read sets on the file level (which SingleStore does as well), and not on the row level. This enables SingleStore to use significantly less CPU and disk I/O than Databricks — especially on queries with high and medium selectivity
Data in SingleStore can be stored in hybrid row and column-centric representations, a key area of innovation that the company began years ago with Universal Storage and we recently extended with Column Group Indexes. This also allows SingleStore to save on disk I/O and CPU compared to Databricks — especially on queries that select all or most of the columns in a table
Writes to SingleStore become consistently queryable in single-digit milliseconds thanks to the in-memory tier and write ahead logging (WAL); compare this to a pipeline that terminates in a Delta table, which is backed by an object store. Each blob write to S3 could take up to 100 ms, there are likely multiple blob writes for each update, and that’s after the data has been translated to Parquet — another step not needed in SingleStore on the latency-critical code path. Finally, end to end, this means writes to Databricks will be one-to-two orders of magnitude slower than SingleStore
Add up all the preceding advantages, and it’s not surprising that SingleStore queries are exceptionally fast compared to Databricks, as you can see in this TPC-H benchmark

More throughput

There are two key factors that influence throughput, the most important being latency. If a SingleStore query takes 10 ms, and the same query on a similarly sized Databricks cluster takes 1 second then, all other things being equal, SingleStore will have 100x the throughput of Databricks. See the above section for details on the superiority of SingleStore in terms of latency
The other additional factor is concurrency. A system that has interruptions from queries interfering with each other will have less throughput — again, with all other things being equal. SingleStore has advantages over Databricks in this regard as well. For example, SingleStore has default row level locking, which you can compare to the equivalent write conflict functionality in Databricks that only operates at the table level (except in a few heavily caveated cases only available in preview). This type of feature is much harder for Databricks because anyone can write to their open tables at any time, which means they have to add a lot of additional steps to avoid write conflicts
The most popular benchmark to test throughput is derived from TPC-C, which delivers its results in “transactions per minute”. We’ve published SingleStore’s performance on TPC-C, and as far as we can tell, Databricks has never done the same and neither have any other third parties

More cost effective

To meet the same real-time SLA as SingleStoreDB, Databricks requires an extra database and an extra messaging bus. And whether you choose open-source software or a managed solution, you are going to end up paying more either way because the former takes more employees and the latter costs money
SingleStore can often execute the same query 10x - 100x faster than Databricks (see latency section), and SingleStore has better concurrency (see throughput section). Since no amount of money will let Databricks match SingleStore latency, throughput can only be matched if Databricks users scale up and spend a lot more money to achieve the same result. Net / net, CSPs charge by the hour, and if you can make your job take way less time, it will cost you way less money

More available

Databricks can’t serve applications and use cases that need RPO=0 and very low RTO because they don’t have high availability features like replication, cross-AZ, always having two hot copies of the data ready for querying and incremental backups

Much simpler

SingleStore is more real time. If an aggregate on streaming data has a windowing function with a 5-second or 1-minute window, SingleStore will surface the data immediately on a partial time window in the next query. Contrast this with Databricks users computing a result of an aggregation in a streaming pipeline — they will only see the result of the aggregation once the window ends and the result is inserted into a database
We won’t force you to reason about joining streams — joining tables is much easier to reason about
You won’t need to worry about late arriving data. If some events are late, the next query will reflect changes made in the past in the event timeline
We support exactly once, so we won’t lose your data — unlike Databricks, where “Exactly once end-to-end processing will not be supported.”
Pipelines ending in stored procedures can perform transformations and maintain running aggregates
SingleStore supports read-modify-write so the final use case can be simpler, without the need to stick to a pure event-based programming and data modeling paradigm
SingleStore can store and execute code in notebooks or stored procedures, whereas Databricks only has notebooks
And finally, at the risk of of repeating ourselves, but it bears repeating, no extra databases are needed

To put it simply, SingleStore’s queries are so efficient and reliably fast that we can support high concurrency and, combined with our high availability, even power applications. This is why companies like LiveRamp and Outreach (which also use Databricks), trust SingleStore to power their mission critical, real- time analytics workloads.

Here’s a table to help you keep track of the everything we’ve discussed:

Capability	Databricks	SingleStoreDB
Storage layers	2 (only 1 automatic)	3
Ingest layer	Object Store (high latency)	Local Disk with replication (low latency)
Products needed for streaming	Databricks + another db	One; only SingleStoreDB
TPCH-SF-10 Benchmark	58.4 seconds	33.2 seconds
TPC-C Benchmark	Unavailable	12,545
Can serve low RPO / RTO applications	No	Yes
Can transform streaming data	Yes (structured streaming)	Yes (pipelines -> stored proc)
Exactly once supported	No	Yes
Easy relational queries	Not in structured streaming	Yes
Best solution for data exploration and machine learning	Yes	No
Best solution for real-time analytics, operations, and applications	No	Yes

Stick around for part 2 of this series, in which we will add more details about the best ways to use SingleStore and Databricks together, and SingleStore’s performance and cost advantages in the non real-time, batch ETL space.

Real-Time Data Platforms: SingleStore vs. Databricks

Understanding the value of real-time data

How SingleStore handles real-time use cases

How Databricks handles real-time use cases

How SingleStore works under the hood

Key differences between SingleStore and Databricks architecture

Strengths of Databricks

Summary: Real-time data platforms: SingleStore vs. Databricks

On this page

Start building with SingleStore

Explore more resources

Real-Time Data Platforms: SingleStore vs. Databricks

understanding-the-value-of-real-time-dataUnderstanding the value of real-time data

how-singlestore-handles-real-time-use-casesHow SingleStore handles real-time use cases

how-databricks-handles-real-time-use-casesHow Databricks handles real-time use cases

how-singlestore-works-under-the-hoodHow SingleStore works under the hood

key-differences-between-singlestore-and-databricks-architectureKey differences between SingleStore and Databricks architecture

strengths-of-databricksStrengths of Databricks

summary-real-time-data-platforms-singlestore-vs-databricksSummary: Real-time data platforms: SingleStore vs. Databricks

On this page

related-readingRelated reading

Start building with SingleStore

Explore more resources

Understanding the value of real-time data

How SingleStore handles real-time use cases

How Databricks handles real-time use cases

How SingleStore works under the hood

Key differences between SingleStore and Databricks architecture

Strengths of Databricks

Summary: Real-time data platforms: SingleStore vs. Databricks

Related reading