The TPC-DS Benchmarking Showdown

This opinion piece from Adam Prout, CTO and Co-Founder of SingleStore discusses the relevance of results derived from the TPC-DS benchmark in today’s modern database market. Do we really need it? Can we do better?

The TPC-DS Benchmarking Showdown - A SingleStore POV

The recent results derived from the TPC-DS benchmarking between Snowflake and Databricks has made for a great drama. Amid the accusations of lack of integrity and poor methodology, I think one important question hasn’t been thoughtfully discussed; should database vendors be running the classic TPC-H or TPC-DS analytical benchmarks at all in 2021? These benchmarks have been around now for 20 years. They were designed in an era when 50 GB hard drives were big and 75 IOPS was a lot. Back when most databases were single-host systems.

Surely there is a better way to compare databases analytical capabilities today? To be honest, the answer is not really. These benchmarks are well known and you can find schemas and results for many different databases online. The TPC council (http://tpc.org/) provides an independent third party to verify and publish official results.

That said, I do think there is some missing context in the recent spat between Snowflake and Databricks. Today, results derived from both TPC-DS and TPC-H are really table stakes benchmarks for data warehouses. All popular analytical databases (Redshift, Snowflake, Azure Synapse) should have roughly the same performance on these benchmarks at most scale factors. The key techniques needed to build a database that can run them competitively are well documented in the literature[1][2][3][4]. Things like:

how to do fast scans and filters on a columnstore with vectorized execution and segment elimination
how to build a distributed query processor and query optimizer that can decompose complex queries to minimize data movement and data scanned

The engineering team at SingleStore regularly runs these benchmarks against ourselves and the competition as part of our release process to validate our performance and avoid regressions. Our testing over the past few years has validated this claim. Most results for mainstream SQL data warehouses are within 10-20% of each other. This is not enough of a difference to provide any competitive advantage for classical DW workloads such as those that bulk load data, followed by a reasonably low throughput workload of complex read queries.

So, why is Databricks so excited about their results derived from the TPC-DS benchmark? Well, they have finally caught up with the table stakes features of a SQL data warehouse and thus now have benchmarking results to prove it. This is a good accomplishment and represents years of work for them, but it's not a grand breakthrough in SQL data warehousing. It's another system that has reached the minimum requirements to be considered for SQL DW workloads.

At SingleStore, we believe good performance on analytical queries and benchmarks is only one aspect of a great database. SingleStore is a more general-purpose relational database than a data warehouse. It combines good analytical capabilities with support for high throughput, low latency (single digit millisecond) reads and writes (including updates) required for more operational workloads. This provides developers with added flexibility to run transactional and analytical workloads over the same data using the same database. And for Chief Data Officers, SingleStore provides a unifying data layer to simplify their data architecture across workloads and storage types. Support for JSON, full-text search and geospatial data often enables SingleStore to replace two or three special purpose databases such as ElasticSearch, Redis, Cassandra, and others—saving time, money and reducing data infrastructure sprawl. To cite just one recent example, Fathom Analytics replaced AWS DynamoDB, Redis, MySQL, and Elasticsearch with just SingleStore to power their real-time, interactive marketing analytics platform.

If you have questions about how SingleStore’s capabilities and how the world’s fastest modern database can solve your database requirements, ask them on our Forum. Development experts and engineers from SingleStore, as well as members of our user community are always happy to help out.

[1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf

[2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf

[3] https://web.stanford.edu/class/cs245/readings/c-store.pdf

[4] http://sites.computer.org/debull/A12mar/vectorwise.pdf