Real-Time Data Platforms: SingleStore vs. Databricks

SingleStore and Databricks are both exceptional data platforms that address important challenges for their customers.

However, when it comes to performance and cost, SingleStore has several, major advantages because it’s built from the ground up for performance, which ends up leading to lower cost. This blog is the first of a multi-part series in which we will examine these differences, and we will begin on the subject of real-time analytics and operations, an area in which SingleStore excels. 

Additionally, we have observed that SingleStore also has cost and performance advantages in non real-time, batch ETL jobs — and we will cover those in a follow up blog.

understanding-the-value-of-real-time-dataUnderstanding the value of real-time data

To begin, let's establish the significance of real-time data. Why do customers value it? The simple answer is in many use cases, the value of data diminishes as it ages. Whether you're optimizing a marketing campaign, monitoring trade speeds, pushing real-time inventory updates, observing network hiccups or watching security events, delays in customer reactions translate to financial losses. The events generated by these sources arrive continuously — in a stream — which has led to the rise of streaming technologies. Databricks' recent blog, "Latency goes subsecond in Apache Spark Structured Streaming," aptly describes this:

“In our conversations with many customers, we have encountered use cases that require consistent sub-second latency. Such low latency use cases arise from applications like operational alerting and real time monitoring, a.k.a "operational workloads."

At SingleStore, we deal in milliseconds, because that’s what matters to our customers. Let’s call this quality latency, and define it as the time it takes for one event to enter the platform, reach its destination and generate value.  There are other important factors to consider, and Databricks correctly points out two more in their blog which describes “give[ing] users the flexibility to balance the tradeoff between throughput, cost and latency”.  We’ll add two more, simplicity and availability, to complete our goals for the ideal real time data platform:

  1. Minimize latency
  2. Maximize throughput
  3. Minimize cost
  4. Maximize availability
  5. Maximize simplicity

how-single-store-handles-real-time-use-casesHow SingleStore handles real-time use cases

First, we’d like to discuss SingleStore’s recommended approach to real-time data use cases, which is to ingest streaming data into SingleStore and query it, illustrated in the following figure.

At this point you are probably thinking, huh? That’s it? There must be more to it than that!  How could one data platform ingest in real time AND serve analytical queries without sacrificing real- time SLA? I hear companies talking about adding new, specialized streaming products all the time. What do they do?

how-databricks-handles-real-time-use-casesHow Databricks handles real-time use cases

As it turns out, Databricks is one such company.  Let’s examine their approach in their recent blog, Latency goes subsecond in Apache Spark Structured Streaming, which includes two illustrations.  In the first illustration,

“Analytical workloads typically ingest, transform, process and analyze data in real time and write the results into Delta Lake backed by object storage” [where it stops being real time]

That’s not the end of the story, as the blog also contains an entirely separate ‘operational workloads’ configuration.  While this existence of this configuration is, by itself, compelling evidence the analytical workloads configuration stops being real time when it reaches Delta Lake, Databricks also pretty much admits this in their blog:

“On the other hand, operational workloads, ingest and process data in real time and automatically trigger a business process.” [that is also in real time]

The curious thing about this second figure is that it ends in a message bus. The data never lands and nothing ends up using it. Databricks solution for real time is to read from Kafka, do transformations and write back to either Kafka or…

“fast key value stores like Apache Cassandra or Redis for downstream integration to business process”

...or other databases! Why would a data platform company like Databricks tell their customers to store data in another database?  Because those databases offer something that Databricks doesn’t: fast point reads and writes (CRUD). They use a key-value format to enable this capability, at the expense of analytical queries, which neither those databases nor Kafka can do easily and efficiently. 

SingleStoreDB, with its patented Universal Storage, can do both transactional and analytical queries. In fact, SingleStore is more than the sum of Databricks and a key value store, since it provides a single SQL interface to perform reads and writes with:

  1. High selectivity (OLTP, including CRUD)
  2. Medium selectivity (real-time analytics) — only SingleStore can do this
  3. Low selectivity (large scale analytics and bulk insert)

While this is certainly enough to explain why Databricks recommends Cassandra or Redis for real time, there is another compelling reason: SingleStore and those databases are more highly available than Databricks. SingleStore has automatic redundancy within the nodes of its clusters (Standard Edition) and even across availability zones with the push of a button (Premium Edition).  Databricks, on the other hand, doesn’t have a page about high availability in its docs. Instead, Databricks talks about how AWS S3, a component of their system is highly available (which does not mean the whole system is highly available).

The absence of this feature explains the existence of this AWS deployment guide which describes how, with considerable effort, you can deploy Databricks clusters in two AZs, but note this is still not your cluster that is cross-AZ, it is just the existence of any clusters in two AZs. If you want your Databricks-powered app to be truly tolerant of an AZ failure, you are doing that yourself by configuring the above and changing your app talk to two clusters — both of which come at the price of a lot more effort, expense and complexity.   

With all of this in mind, this illustration of Databricks’ proposal is a more complete representation of their proposed Rube Goldberg Machine — cough, we mean real-time data platform, along with its drawbacks.

Databricks' recommended configuration of operational streaming pipelines can be greatly simplified by replacing all of it with SingleStore, which is built for real time and requires only a single message bus for ingestion.

Option 3: Simple analytical queries, highly available and real time

how-single-store-works-under-the-hoodHow SingleStore works under the hood

Wondering how we do it?  We’re glad you asked! Let’s take a deeper dive into the architecture that makes SingleStore a simple and performant platform for real-time analytics.

Streaming data originates from the source and events are ingested by SingleStore’s Pipelines, which are fully parallelized, and can read data from Kafka and a variety of other sources in many popular formats.  Another possible source of real-time data is DML statements to insert, update, delete and upsert data.  These can run with high throughput and concurrently with streaming ingest thanks to row level locking — which means that individual rows, rather than whole tables, can be locked for writes.  This greatly increases the throughput of the end-to-end system.

Transformations can be applied with stored procedures, which can be called as the endpoints of pipelines in SingleStore and allow our customers to apply complex transformations to streaming data including filtering, joins, grouping aggregations and the ability to write into multiple tables.  Since they can serve as pipeline endpoints, there’s a single partitioned writer working on batches of data, facilitating parallelism. 

Here’s an example of a stored procedure that maintains a custom running SUM (or AVG) aggregation on grouped data from a pipeline containing CDC data (where the ‘action’ column may contain ‘DELETED’ and ‘INSERTED’).

CREATE PROCEDURE my_custom_sum (
    cdc QUERY(c1 INT, c2 TEXT, action VARCHAR)
AS
BEGIN
  INSERT INTO my_custom_mv
  SELECT col2, SUM( IF(action=’DELETED’, -col1, col1) ) AS sum,
               SUM( IF(action=’DELETED’, -1, 1) ) AS num_rows
  FROM cdc
  GROUP BY col2
  HAVING sum != 0 OR num_rows != 0
  ON DUPLICATE KEY UPDATE sum=sum+VALUES(sum),
                          num_rows=num_rows+VALUES(num_rows);
  DELETE FROM my_custom_mv WHERE num_rows = 0;
END

After it’s transformed, data is written into Tier 1, which is the memory layer of the LSM Tree (the main data structure backing SingleStoreDB tables). These writes use a replicated Write Ahead Log (WAL) to persist to Tier 2, the local disk and persistence to Tier 3 is done lazily in the background — not on the latency critical path. The net result? The data becomes consistently queryable in single-digit milliseconds.

key-differences-between-single-store-and-databricks-architectureKey differences between SingleStore and Databricks architecture

Why can’t Databricks offer comparable real time capabilities?  There are two main reasons:

  1. For writes, Tiers 1 + 2 don’t exist
  2. For reads, Tier 1 doesn’t exist and Tier 2 is off by default, harder to use and adds latency

Let’s examine the write path first. In SingleStore, writes arrive in Tier 1, the logs are written to Tier 2 and data is replicated throughout the system and instantly queryable.  Contrast this with Databricks, where writes have to go all the way to the cloud object store before they are acknowledged.

The read path has similar limitations. In SingleStore, Universal Storage takes advantage of both Tiers 1 and 2, and purely in-memory rowstore tables can also be used for the maximum performance optimization.  Compare this with Databricks, which famously stores nothing in its Spark memory layer — which is great, until you want to read really fast. 

Further, Databricks’ disk layer is off by default and even when enabled, new data must first be ingested into the object store and only then pulled into the cache, adding a lot of latency. In SingleStore new data is written to disk on the way in, so it’s already there to be read when you need it.

Most importantly, Databricks knows it’s not possible to write to and read from the cloud object store with low latency — and they have designed their entire streaming architecture as a way to compensate for the absence of this capability.  

Databricks recommends their users split their application into two parts, executed by completely different systems:

  1. Pre-processing data with Spark Structured Streaming pipelines
  2. Lighter weight queries over pre-processed data

However, the first system introduces delays and makes processing less real time, and the second still doesn’t deliver low enough latency for many scenarios.

SingleStore can do fast, low latency queries either over raw ingested data or pre-processed data in stored procedures that are endpoints of ingest pipelines. In the latter case, pre-processing is done in the same environment using SQL. This results in legitimately real-time processing.

strengths-of-databricksStrengths of Databricks

Despite all of the above, streaming architectures that never touch a database do have their uses.  For example, you might have a truly massive amount of data — more than would ever fit in storage — and you just want to make a few transforms to events in one Kafka stream, and re-emit another Kafka stream that triggers an alert. 

Databricks has also made great advances in data exploration, and developers love the flexibility of their notebook interface.  Furthermore, their product has a lot of advanced machine learning capabilities. 

Databricks is also widely used to power ETL jobs, although SingleStore has some performance and cost advantages in this space, so some jobs might make more sense on SingleStore. We will cover this topic and the best ways to use the two products together in a future blog in this series.

summary-real-time-data-platforms-single-store-vs-databricksSummary: Real-time data platforms: SingleStore vs. Databricks

For real-time use cases, Apache Spark Structured Streaming and another database is an overly complicated and impractical solution when you can simply ingest streaming data into SingleStore and query it.

Lower latency

  • SingleStore has an in-memory data tier for freshly ingested trickle inserts and updates, as well as faster access to metadata. This layer is absent in Databricks 
  • SingleStore has row-level indexes found in operational systems and data formats supporting cheap seeks, while Databricks only supports redundant data structures that are used to prune read sets on the file level (which SingleStore does as well), and not on the row level. This enables SingleStore to use significantly less CPU and disk I/O than Databricks — especially on queries with high and medium selectivity

  • Data in SingleStore can be stored in hybrid row and column-centric representations, a key area of innovation that the company began years ago with Universal Storage and we recently extended with Column Group Indexes. This also allows SingleStore to save on disk I/O and CPU compared to Databricks —  especially on queries that select all or most of the columns in a table
  • Writes to SingleStore become consistently queryable in single-digit milliseconds thanks to the in-memory tier and write ahead logging (WAL); compare this to a pipeline that terminates in a Delta table, which is backed by an object store. Each blob write to S3 could take up to 100 ms, there are likely multiple blob writes for each update, and that’s after the data has been translated to Parquet — another step not needed in SingleStore on the latency-critical code path. Finally, end to end, this means writes to Databricks will be one-to-two orders of magnitude slower than SingleStore
  • Add up all the preceding advantages, and it’s not surprising that SingleStore queries are exceptionally fast compared to Databricks, as you can see in this TPC-H benchmark

More throughput

  • There are two key factors that influence throughput, the most important being latency.  If a SingleStore query takes 10 ms, and the same query on a similarly sized Databricks cluster takes 1 second then, all other things being equal, SingleStore will have 100x the throughput of Databricks. See the above section for details on the superiority of SingleStore in terms of latency
  • The other additional factor is concurrency. A system that has interruptions from queries interfering with each other will have less throughput —  again, with all other things being equal.  SingleStore has advantages over Databricks in this regard as well. For example, SingleStore has default row level locking, which you can compare to the equivalent write conflict functionality in Databricks that only operates at the table level (except in a few heavily caveated cases only available in preview). This type of feature is much harder for Databricks because anyone can write to their open tables at any time, which means they have to add a lot of additional steps to avoid write conflicts

  • The most popular benchmark to test throughput is derived from TPC-C, which delivers its results in “transactions per minute”. We’ve published SingleStore’s performance on TPC-C, and as far as we can tell, Databricks has never done the same and neither have any other third parties

More cost effective

  • To meet the same real-time SLA as SingleStoreDB, Databricks requires an extra database and an extra messaging bus. And whether you choose open-source software or a managed solution, you are going to end up paying more either way because the former takes more employees and the latter costs money
  • SingleStore can often execute the same query 10x - 100x faster than Databricks (see latency section), and SingleStore has better concurrency (see throughput section). Since no amount of money will let Databricks match SingleStore latency, throughput can only be matched if Databricks users scale up and spend a lot more money to achieve the same result.  Net / net, CSPs charge by the hour, and if you can make your job take way less time, it will cost you way less money

More available

  • Databricks can’t serve applications and use cases that need RPO=0 and very low RTO because they don’t have high availability features like replication, cross-AZ, always having two hot copies of the data ready for querying and incremental backups

Much simpler

  • SingleStore is more real time. If an aggregate on streaming data has a windowing function with a 5-second or 1-minute window, SingleStore will surface the data immediately on a partial time window in the next query.  Contrast this with Databricks users computing a result of an aggregation in a streaming pipeline — they will only see the result of the aggregation once the window ends and the result is inserted into a database 

  • We won’t force you to reason about joining streams — joining tables is much easier to reason about
  • You won’t need to worry about late arriving data. If some events are late, the next query will reflect changes made in the past in the event timeline

  • We support exactly once, so we won’t lose your data — unlike Databricks, where “Exactly once end-to-end processing will not be supported.”
  • Pipelines ending in stored procedures can perform transformations and maintain running aggregates
  • SingleStore supports read-modify-write so the final use case can be simpler, without the need to stick to a pure event-based programming and data modeling paradigm
  • SingleStore can store and execute code in notebooks or stored procedures, whereas Databricks only has notebooks
  • And finally, at the risk of of repeating ourselves, but it bears repeating, no extra databases are needed

To put it simply, SingleStore’s queries are so efficient and reliably fast that we can support high concurrency and, combined with our high availability, even power applications.  This is why companies like LiveRamp and Outreach (which also use Databricks), trust SingleStore to power their mission critical, real- time analytics workloads.

Here’s a table to help you keep track of the everything we’ve discussed:

CapabilityDatabricksSingleStoreDB
Storage layers2 (only 1 automatic)3
Ingest layerObject Store (high latency)Local Disk with replication (low latency)
Products needed for streamingDatabricks + another dbOne; only SingleStoreDB
TPCH-SF-10 Benchmark58.4 seconds33.2 seconds
TPC-C BenchmarkUnavailable12,545
Can serve low RPO / RTO applicationsNoYes
Can transform streaming dataYes (structured streaming)Yes (pipelines -> stored proc)
Exactly once supportedNoYes
Easy relational queriesNot in structured streamingYes
Best solution for data exploration and machine learningYesNo
Best solution for real-time analytics, operations, and applicationsNoYes

Stick around for part 2 of this series, in which we will add more details about the best ways to use SingleStore and Databricks together, and SingleStore’s performance and cost advantages in the non real-time, batch ETL space.


Share