The Future is Bottomless

JV

Joseph Victor

Engineer

The Future is Bottomless

SingleStore is a state of the art, distributed, scale-out, transaction processing and query execution engine. In the same system, we can do kickass large-scale analytical queries and ultra-low-latency key-value workloads. With the release of SingleStoreDB Self-Managed 7.0 at the end of 2019, our new replication protocol, which is something I am personally very proud of, has greatly improved the durability of data stored in SingleStore clusters. I don’t think there’s any other system in the world that can do what we can do, and you can check our TPC-H, TPC-DS and TPC-C benchmark results to prove it.

Yet, there’s an awkwardness. Compared with some of our competitors, who couldn’t dream of tackling the workloads we do, SingleStore doesn’t “just work”. Yes, we can work amazingly at scale, but actually scaling is a delicate and manual process. Yes, our durability is rock-solid, but users still have to take and manage their own backups. Yes, our integrated-storage-and-compute gives incredible (and predictable) query performance, but rebalancing could become awkward when it would require moving hundreds of terabytes of data that might never be queried.

So the question we are left with is this: can we solve all of these problems in a way that doesn’t sacrifice the things that make SingleStore SingleStore? Can we build it with tight integration with our existing best-in-the-world clustered storage, and use it to simultaneously make SingleStore more suitable for system of record, and also have the perfect out-of-the-box cloud experience? Can we draw inspiration from the best ideas from Aurora and Snowflake and Big Query, but do it the SingleStore way? And can we do it in a way that works on any cloud, or on-prem?

This is what we hope to achieve with Bottomless.

what-is-bottomlessWhat is Bottomless?

Bottomless is the separation of storage and compute for SingleStore. Basically, data files (snapshots, logs and blobs) are persisted to S3 (or comparable blob storage, or even NFS) asynchronously. The “blobs” I’m talking about are the compressed, encoded data structures backing our columnstore. We still maintain high availability within the SingleStore cluster for the most recent data, and for availability in case of a failure, but long-term storage moves to blob storage. In this world, blobs which aren’t being queried can be deleted from SingleStore node’s local disk, allowing the cluster to hold more data than available disk, thus making the cluster’s storage “bottomless”. In fact, new replicas don’t need to download all blob files in order to come online, so creating and moving partitions (which are slices of each table’s data) becomes very cheap.

The extra durability coming from all data eventually landing in blob storage is game-changing, making the way we used to think about backups and disaster recovery obsolete. It’s essentially “continuous backup”. This will obviously help us address larger petabyte-sized datasets for historical analytics. But, we don’t intend to do this only to support analytic workloads. Since we’re SingleStore, we’ll bring Bottomless storage to durable transactional, low-latency workloads without compromising performance.

much-more-flexible-clusteringMuch More Flexible Clustering

Not requiring all the blobs to be locally on disk means provisioning or moving a replica can be very fast, on the order of minutes. This can give you faster rebalancing, allowing for a few interesting use cases.

  • Scaling up and down: If you need to burst your workload in the cloud, you could increase the number of leaves and rebalance onto them while the workload runs. When the burst is over, you can more densely pack the partitions and save on the CPUs and memory you aren’t using to run queries.

  • Adding read replicas: You could quickly provision a set of read-only replicas to run a query workload against, and discard them when you’re done.

  • Low recovery time objective (RTO): Restoring doesn’t involve blobs; a database can be restored and available for queries very quickly in the event of a disaster.

  • Read only point-in-time recovery (PITR): It is possible, in a Bottomless storage world, to quickly spin up a new set of replicas for a fixed point-in-time in the past, and discard these extra replicas when you’re done.

bottomless-is-backup-theory-of-everythingBottomless is Backup-Theory-Of-Everything

All the data eventually goes to blob storage, so this, by definition, is a continuous backup. Using a forthcoming Global Versioning scheme – it’s not released yet, but it’s similar to Clock-SI with some SingleStore specific innovations – we will have the ability to recover to a consistent point any time in the past (or at the tip of history). Taking explicit daily backups, then, isn’t really a thing: you are always taking backups. If you want an off-site backup, simply replicate your remote storage (or, if it’s S3, just accept they have 11 nines of durability and enjoy).

So in this world:

  • Backup
  • Incremental backup
  • Continuous backup
  • Disaster recovery
  • Point-in-time recovery (PITR)

are all the same feature, and it’s always-on at virtually no additional overhead. Of course, if you really want to take an explicit backup, you can, but now it can be taken instantaneously: simply record the current Global Version.

query-processingQuery Processing

A Bottomless experience means that data files can be deleted from local disk once they are persisted to blob storage, at least if they aren’t being queried. This means the size of your local disk doesn’t determine how much data you can store. The user can insert and insert and insert and never delete, or have a very long retention period. Since our data is stored in a columnstore format, this can happen on the per-column level: columns which aren’t being queried often need not be stored locally! And read replicas querying only a subset of the data need only store that subset.

But the system has to be a little bit smart. A newly provisioned replica will be “queryable” in minutes, but those queries will be slow, so the new replica should probably start downloading “popular” blobs immediately in the background. Further, a replica that might be used to maintain availability (that is, one which could become a master at any moment), needs to keep its cache as close as possible to the master’s, so that a failover won’t cause a blip in the workload while the cache is warming up. For the same reason, rebalance operations (and other controlled failovers) should make sure the cache is sufficiently hot before proceeding. All this to say, blob caching gives query processing incredible flexibility, including the ability to store more data than there is disk.

the-possibilities-are-bottomlessThe Possibilities are Bottomless

We can now start to think about absolutely incredible capabilities. Possibilities such as:

  • Data sharing
    • Give an org within your company a consistent, read-only copy of the database which doesn’t take resources from the main cluster, and spin them down when you’re done. These copies can either be synchronized with the main cluster or an unchanging snapshot of the database at a specific point in the past.
  • Merger-as-a-service
    • Today we merge layers in log structured merge trees for each table in each partition. We can turn it into a separate service. It just reads and writes to blob storage, and tells the master about the new metadata. This moves the computationally expensive parts of our architecture into dedicated services. This service would be inherently stateless.
  • Ingest-pipelines-as-a-service
    • For append-only workloads, there is no reason for ingest to be in the same cluster. It can just read from your existing pipeline (Kafka, S3, or whatever), upload blobs, and tell the primary cluster about the new metadata.

managing-unpredictabilityManaging Unpredictability

To close, I’ll note the dynamic macro environment we’re in today. It’s difficult enough to anticipate the workload and storage needs for a system or collection of systems as changing requirements drive unexpected business scenarios. With Bottomless, the user doesn’t have to worry about provisioning storage, nor managing backups and durability. And changing the cluster size is a fully online size-of-metadata operation, so it’s easy to scale-up the resources you need or share data within your organization. It just works.

What we seek to provide is even greater operational efficiency and workload isolation for system-of-record, low-latency transactions, along with support for large complex analytics, all in the same system. In this sense, we’re building Bottomless to have synergy with our concept of Universal Storage. The benefit will be even greater flexibility to right-size your data infrastructure to the current need, while reducing the variety of data management layers on your architecture.


Share