Separation of storage and compute is an important capability of most cloud native databases. Being able to scale compute independently from storage allows for cost savings as well as improved performance and elasticity. For example, operational databases such as Amazon Aurora and Azure Hyperscale have introduced separate log and storage services that do the work of making the log durable as well as some log maintenance work that would typically be done by the database engine itself. The result is lower network and disk utilization to provide high availability and durability of data. Cloud data warehouses (DWs) such as Snowflake and Amazon Redshift write data out to blob storage, keeping only hot data cached on local disks to be used by queries. This allows for databases larger than local disk and for compute to be quickly added or removed by pulling data from blob storage.
Both of these classes of databases (operational and analytical) make trade-offs that optimize for the specific use cases they target. For example, the cloud DWs force data to be written to blob storage before a transaction is considered committed. This means a small write transaction has higher latency on a cloud DW than on a traditional SQL database because blob store writes have higher and less predictable latency than local disk writes. This trade-off is reasonable for a cloud DW that is designed for large batch loading via large write transactions. Cloud operational databases don’t use blob storage at all, so this limits the maximum possible size of a database, increases the cost of storing rarely queried data or backups, and limits the throughput of data access for processing or for scale out. They are designed for smaller writes and reads with more stringent availability and durability requirements (multiple copies of data stored across multiple availability zones and regions). For a cloud operational database, it’s rare to have really large data sizes (hundreds of terabytes) and also rare to have infrequently queried data, so avoiding blob storage all together is a trade-off that makes sense for them.
SingleStore’s (S2) separation of storage and compute design codenamed bottomless was built to have many of the advantages of both cloud operational and cloud DW databases. S2 is able to commit write transactions without doing any blob store writes while at the same time using blob storage to support databases larger than local storage, have faster scale out, and improved durability. S2 can use the data in blob storage for backup and point in time recovery. Blob storage at rest is so cheap that S2 can store weeks or months of history cheaply. The remainder of this article describes the design for bottomless storage in more detail to give you a feeling for the different design trade-offs it makes.

How it works

In order to dig into how bottomless works, we need some background on how S2 organizes data on disk. S2 supports both rowstore and columnstore storage. Both of these storage types have their own on-disk layouts. The rowstore is in-memory only (data must fit into memory). Its writes are persisted to disk in a transaction log. The log is checkpointed by writing out a full snapshot of the in-memory state to disk and truncating log data that existed before the snapshot. By default, S2 retains two previous snapshots as a safeguard against data corruption as shown in the diagram below. Rowstore data is replicated via log shipping to replicas for high availability.
Columnstore data is persisted in a different manner. It’s written out to disk column by column with each column stored in a data file on disk with a best fit compression scheme applied. A set of column data files for 1 million compressed rows is called a segment as shown in the diagram below. Each segment has metadata such as the minimum and maximum values in each column as well as a bitmap of deleted rows in the segment stored in-memory in a rowstore internal table. Each data file is named based on the log sequence number (LSN) of the rowstore log at which its metadata is committed. You can think of the columnstore files as logically existing in the rowstore log stream, though they physically exist as separate files on disk (not appended into the physical log). As in a traditional columnstore the files are immutable. Updates mark rows as deleted in metadata and write out updated rows in new columnstore segment files. When a segment has enough deleted rows it is merged into other segment files asynchronously. The segment files are sorted into a log structured merge (LSM) tree with an in-memory rowstore layer at the initial level to absorb any small writes. The LSM tree allows for efficient ordered access while reducing the cost of keeping the data sorted as it is updated. The details of how the LSM tree functions are not important for understanding the separation of storage and compute design, so are not covered in this article. For more details, check out this talk by Joseph Victor.
An important observation about S2’s storage design is that all data files are immutable once written other than the tail of the currently active log file. This immutability is an important property that makes persisting data in blob storage natural for S2. Blob stores don’t typically support APIs for updating files, and even if they did, the round trips to do small updates would result in poor performance. 
S2’s separation of storage and compute design can be summarized as follows:
  • Transactions are committed to the tail of the log on local disk and replicated to other nodes for durability just as when S2 runs without blob storage available. The tail of the log is always stored on local disk. There are no blob store writes required to commit a transaction.
  • Newly committed columnstore files are pushed to blob storage as quickly as possible asynchronously after being committed. Hot data is kept cached locally on disk for use by queries and cold data is removed from local disk. This allows for fast local disks to be used for storing cached columnstore data.
  • Transaction logs are uploaded to blob storage in chunks below an LSN known to contain only committed data. The uncommitted tail of the log is never uploaded to the blob store.
  • Snapshots of rowstore data are only taken on primary databases and pushed to blob storage. Replicas don’t need to do their own snapshotting saving on disk IO. If a replica ever needs a snapshot (say because it was disconnected for a long period of time) it can get it from blob storage.
  • To add more compute to the cluster, new replica databases get the snapshots and logs they need from blob storage and the uncommitted log tail not yet in blob storage is replicated from the primary. Columnstore files are pulled from the blob store and cached on demand as needed by queries.

The advantages (and disadvantages) of separation

This gives bottomless the following nice properties:
  • Small write transactions have no extra latency compared to S2 that doesn’t use blob storage (no writes to blob storage to commit transactions).
  • New replicas and new hosts can be spun up and added to the cluster by pulling data from blob storage which has higher bandwidth then pulling data from primary databases on other nodes in the cluster. The result is less network and disk IO used by existing nodes in the cluster to add a new node.
  • Replicas can now start acknowledging committed transactions without needing to have all columnstore data replicated. The Replica just needs the snapshot and logs. The columnstore data (the bulk of data stored in the database) is pulled from the blob store on demand as needed by queries and cached.
  • Local disks and compute instances can be sized separately from data storage requirements. For example, S2 can now use fast SSDs for local disk sized to accommodate the working set of the application. Using more expensive, and much slower, block storage for local disks (e.g., Amazon EBS) is no longer needed as the blob store is acting as continuous backup. This choice does come with some trade-offs discussed in the disadvantages section below.
  • The blob store acts as an extra layer of durability. S2 can keep months of history due to how cheap it is to store data at rest in blob storage. It can do point in time recovery to previous points in time based on the stored history. The blob store is acting like a continuous backup.
  • Quickly stopping all compute is fast and easy. The stop operation only needs to wait until any data that only exists on local disk is moved to the blob store (tail of the log and any recently written columnstore files).
This design does have some disadvantages we should point out:
  • Durability and availability are not separated in S2 today. S2’s log replication is responsible for doing both. This means that loss of all local disks in the cluster (all replica’s) impacts durability. This is mitigated by storing multiple copies of the data in the cluster in different availability zones so only a full region outage could trigger data loss of any data that is not yet in blob storage.
    • One interesting side note is that relying on S2 replication on ephemeral disks avoids the duplication of work as when other cloud databases put data on block storage (e.g., EBS) for durability and then also enable their own replication at the database layer for high availability essentially replicating the data twice (Once by EBS and again by the database’s replication protocol). 
  • S2 can’t rely on blob storage cross region replication for cross region high availability as some cloud DWs do. Recall that recently committed data is not always stored in blob storage, so loss of a region will mean loss of some data on local disk. This is similar to the downside of using asynchronous replication for cross region HA. So, on loss of a region the data can be recovered to the latest consistent state in blob storage, but that state may not have all committed transactions. 
As a result, bottomless allows for improved elasticity at lower costs. If a workload is busier during particular hours of the day, more compute can be quickly brought to bear during that time frame. If a workload is idle overnight, compute can be quickly paused and resumed in the morning. By storing history in the blob store, costly user errors (e.g., accidentally dropping a table) can be recovered quickly with very little extra storage cost. Bottomless clusters no longer need expensive block storage (EBS) for durability, which also drives down costs and improves performances of disk heavy workloads.

Conclusion

In summary, S2’s bottomless design allows it to scale out faster, have improved durability and backup capabilities, and use faster local disks for cached data. The design relies on S2’s storage layout based on immutable files and on its high performance replication and durability code. Bottomless avoids some of the downsides of more specialized designs, specifically slower transaction commits that would otherwise make the database less usable for data intensive applications.