Separation of storage and compute is an important capability of most cloud native databases. Being able to scale compute independently from storage allows for cost savings as well as improved performance and elasticity. For example, operational databases such as Amazon Aurora and Azure Hyperscale have introduced separate log and storage services that do the work of making the log durable as well as some log maintenance work that would typically be done by the database engine itself. The result is lower network and disk utilization to provide high availability and durability of data. Cloud data warehouses (DWs) such as Snowflake and Amazon Redshift write data out to blob storage, keeping only hot data cached on local disks to be used by queries. This allows for databases larger than local disk and for compute to be quickly added or removed by pulling data from blob storage.
Both of these classes of databases (operational and analytical) make trade-offs that optimize for the specific use cases they target. For example, the cloud DWs force data to be written to blob storage before a transaction is considered committed. This means a small write transaction has higher latency on a cloud DW than on a traditional SQL database because blob store writes have higher and less predictable latency than local disk writes. This trade-off is reasonable for a cloud DW that is designed for large batch loading via large write transactions. Cloud operational databases don’t use blob storage at all, so this limits the maximum possible size of a database, increases the cost of storing rarely queried data or backups, and limits the throughput of data access for processing or for scale out. They are designed for smaller writes and reads with more stringent availability and durability requirements (multiple copies of data stored across multiple availability zones and regions). For a cloud operational database, it’s rare to have really large data sizes (hundreds of terabytes) and also rare to have infrequently queried data, so avoiding blob storage all together is a trade-off that makes sense for them.
SingleStore’s (S2) separation of storage and compute design codenamed bottomless was built to have many of the advantages of both cloud operational and cloud DW databases. S2 is able to commit write transactions without doing any blob store writes while at the same time using blob storage to support databases larger than local storage, have faster scale out, and improved durability. S2 can use the data in blob storage for backup and point in time recovery. Blob storage at rest is so cheap that S2 can store weeks or months of history cheaply. The remainder of this article describes the design for bottomless storage in more detail to give you a feeling for the different design trade-offs it makes.
How it works
In order to dig into how bottomless works, we need some background on how S2 organizes data on disk. S2 supports both rowstore and columnstore storage. Both of these storage types have their own on-disk layouts. The rowstore is in-memory only (data must fit into memory). Its writes are persisted to disk in a transaction log. The log is checkpointed by writing out a full snapshot of the in-memory state to disk and truncating log data that existed before the snapshot. By default, S2 retains two previous snapshots as a safeguard against data corruption as shown in the diagram below. Rowstore data is replicated via log shipping to replicas for high availability.
Columnstore data is persisted in a different manner. It’s written out to disk column by column with each column stored in a data file on disk with a best fit compression scheme applied. A set of column data files for 1 million compressed rows is called a segment as shown in the diagram below. Each segment has metadata such as the minimum and maximum values in each column as well as a bitmap of deleted rows in the segment stored in-memory in a rowstore internal table. Each data file is named based on the log sequence number (LSN) of the rowstore log at which its metadata is committed. You can think of the columnstore files as logically existing in the rowstore log stream, though they physically exist as separate files on disk (not appended into the physical log). As in a traditional columnstore the files are immutable. Updates mark rows as deleted in metadata and write out updated rows in new columnstore segment files. When a segment has enough deleted rows it is merged into other segment files asynchronously. The segment files are sorted into a log structured merge (LSM) tree with an in-memory rowstore layer at the initial level to absorb any small writes. The LSM tree allows for efficient ordered access while reducing the cost of keeping the data sorted as it is updated. The details of how the LSM tree functions are not important for understanding the separation of storage and compute design, so are not covered in this article. For more details, check out this talk by Joseph Victor.
An important observation about S2’s storage design is that all data files are immutable once written other than the tail of the currently active log file. This immutability is an important property that makes persisting data in blob storage natural for S2. Blob stores don’t typically support APIs for updating files, and even if they did, the round trips to do small updates would result in poor performance.
S2’s separation of storage and compute design can be summarized as follows:
The advantages (and disadvantages) of separation
This gives bottomless the following nice properties:
This design does have some disadvantages we should point out:
As a result, bottomless allows for improved elasticity at lower costs. If a workload is busier during particular hours of the day, more compute can be quickly brought to bear during that time frame. If a workload is idle overnight, compute can be quickly paused and resumed in the morning. By storing history in the blob store, costly user errors (e.g., accidentally dropping a table) can be recovered quickly with very little extra storage cost. Bottomless clusters no longer need expensive block storage (EBS) for durability, which also drives down costs and improves performances of disk heavy workloads.
In summary, S2’s bottomless design allows it to scale out faster, have improved durability and backup capabilities, and use faster local disks for cached data. The design relies on S2’s storage layout based on immutable files and on its high performance replication and durability code. Bottomless avoids some of the downsides of more specialized designs, specifically slower transaction commits that would otherwise make the database less usable for data intensive applications.