SingleStore’s Patented Universal Storage

SingleStore Universal Storage technology is a dramatic evolution of our columnstore data format that allows it to support both operational and analytical workloads with low total cost of ownership (TCO). People love our rowstores for OLTP-style access patterns because of our great performance based on memory-optimized data structures and compilation of queries to machine code. But for users with large data sets, the cost of providing servers with enough RAM to hold all the data started to be significant. Universal Storage solves this cost problem, while also providing even better analytical performance via query execution that takes advantage of columnar storage formats, vectorized execution, and single-instruction multiple-data (SIMD) instructions.

This is the third in a series of four articles that describe SingleStore's unique, patented Universal Storage feature. Read Part 1, Part 2 and Part 4 in the series to learn the whole story.

Our 7.0 release introduced universal storage and the 7.1 release enhanced it. Now, we’re delivering Episode 3 of universal storage in our 7.3 release. In this installment, we’ve added:

Columnstore can be made the default table type by setting a new engine variable default_table_type to ‘columnstore’
UPSERT support for columnstore tables with single-column unique keys, including
- INSERT ON DUPLICATE KEY UPDATE
- INSERT IGNORE
- REPLACE
- Analogous capabilities for Pipelines and LOAD DATA
Support for unique key enforcement for very large INSERTs and UPDATEs
- Now, arbitrary INSERTs and UPDATEs, regardless of the number of rows updated, will succeed. In 7.1, if more than, say, 20-30 million rows were updated in a single statement on a small node, the operation could fail, due to an out-of-memory condition.

Please see this video to learn more about our universal storage update in SingleStore 7.3:

Columnstore as Default

In Singlestore Helios, the default table type will be set to columnstore for new clusters when 7.3 is deployed. In SingleStoreDB Self-Managed, the default table type is rowstore, but can be set to columnstore by running:

set global default_table_type = 'columnstore';

Now, just about any CREATE TABLE statement you might have run to create an application on MySQL, MariaDB, or for SingleStore rowstore tables will execute properly and create a columnstore. The biggest benefit of this for new users is that they won’t run out of RAM when they add a lot of data. Veteran users who use columnstore most of the time will appreciate the convenience of it.

As an example, the following CREATE TABLE statement will run and create a columnstore in this mode:

create table fact_sales(ts datetime(6), qty int, prod_id int,
 unit_price decimal(18,2), store_id int);

This statement is equivalent to the following, which is output by show create table fact_sales:

CREATE TABLE `fact_sales` ( `ts` datetime(6) DEFAULT NULL,
`qty` int(11) DEFAULT NULL,
`prod_id` int(11) DEFAULT NULL,
`unit_price` decimal(18,2) DEFAULT NULL,
`store_id` int(11) DEFAULT NULL,
KEY `__UNORDERED` () USING CLUSTERED COLUMNSTORE , SHARD KEY () );

Notice that the default shard key is empty, resulting in round-robin data distribution across leaf nodes. Also, the default columnstore sort key is empty, which means the system doesn’t actively maintain the data in sorted order. For many use cases, and for developers just getting started, these choices are fine.

You can easily specify a columnstore sort key with the new shorthand SORT KEY that’s equivalent to KEY(…) USING CLUSTERED COLUMNSTORE.

In addition, KEY(…) designations in a CREATE TABLE statement automatically cause creation of hash keys. See our 7.3 documentation for full details.

If you’re in columnstore-as-default mode and you want a rowstore, you can just say

CREATE ROWSTORE TABLE tableName(...)

Bottom line, if you’re new to SingleStore and you don’t want to worry about running out of RAM, you have a mostly-analytical workload, or both, set default_table_type = ‘columnstore’. And, on our managed service, you’ll get this by default.

Upserts on Universal Storage

When we designed the roadmap for universal storage, we saw that many of our users were using rowstores to accept new data, detect duplicates, and do upserts. Then after the data settled down, they’d move it to a columnstore. And then they’d have to do some special query logic to query the colder columnstore part and the hotter rowstore part together to combine the data in both of them.

Universal storage eliminates the need to do this. And 7.3 delivers a major part of this: upsert capability columnstores.

The most typical kind of upsert does a conditional insert if a key is not present, and otherwise updates the existing record with that key somehow. E.g. the raw input data might have “network events” which record

device_id — a unique identifier of a network device

bytes_transmitted — bytes transmitted for this even

ts — the time the transmission even occurred

Perhaps this is in the following table:

create table network_events_new(
  device_id_n int,
  bytes_transmitted_n int,
  ts_n datetime(6),
  sort key(ts_n),
  shard key (device_id_n)
);

Remember, this is a columnstore; the SORT KEY makes it so.

Now, suppose our ultimate goal is to keep a running total of bytes transferred by a device, as well as the latest time of an event affecting that device, in this table:

create table network_events_summary(
  device_id int,
  bytes_transmitted int,
  ts datetime(6),
  unique key(device_id) using hash,
  sort key(ts),
  shard key (device_id)
);

Now, load some initial data into the “new” networks events table:

load data infile "/data/network_events.csv"
into table network_events_new
fields terminated by ','
lines terminated by '\n';

Where the .csv file contains:

1,1024,2020-12-09 14:22:06.765384
2,4096,2020-12-09 14:22:43.613131
2,2048,2020-12-09 14:23:06.786447
1,512,2020-12-09 14:23:27.964939
3,128,2020-12-09 15:55:48.209948

Suppose the initial data in network_events_summary is created as follows:

insert into network_events_summary values(1,256,"2020-12-09 12:57:46.244642");

So it contains just:

+-----------+-------------------+----------------------------+
| device_id | bytes_transmitted | ts                         |
+-----------+-------------------+----------------------------+
|         1 |               256 | 2020-12-09 12:57:46.244642 |
+-----------+-------------------+----------------------------+

Now, we can do an upsert to tally up total bytes transmitted, plus add rows for new devices seen for the first time:

insert into network_events_summary (device_id, bytes_transmitted, ts)
select * from network_events_new
on duplicate key update
bytes_transmitted = bytes_transmitted + values(bytes_transmitted),
ts = values(ts);

Now, the summary will have all three devices and their totals and last-updated time, as follows:

sdb> select * from network_events_summary order by device_id;
+-----------+-------------------+----------------------------+
| device_id | bytes_transmitted | ts                         |
+-----------+-------------------+----------------------------+
|         1 |              1792 | 2020-12-09 14:23:27.964939 |

|         2 |              6144 | 2020-12-09 14:23:06.786447 |

|         3 |               128 | 2020-12-09 15:55:48.209948 |
+-----------+-------------------+----------------------------+

This kind of upsert pattern previously worked on SingleStore rowstore tables with a unique key. What’s new is that it now works on columnstore tables with single-column unique keys.

UPSERT-style logic features (by which I mean REPLACE, IGNORE, and SKIP DUPLICATE KEY) of LOAD DATA and pipelines also now can be used with columnstores, also with the single-column unique key restriction. Universal storage is getting very near functionally complete, and can already cover most use cases. Look for the full functionality of universal storage to be finished, with introduction of multi-column key support, in Episode 4!

Support for Large INSERTs/UPDATEs With Unique Keys

As a simple example of a large batch INSERT with a unique key, consider this script:

drop database if exists db1;
create database db1;
use db1;

create table t(a int not null, unique key(a) using hash, shard(a), key(a) using clustered columnstore);

insert into t values(1);

delimiter //
/* Fill table t with n or more rows, doubling the size until
   the goal is reached. */
create procedure inflate(n bigint) as
declare
  c bigint;
begin
  select count(*) from t into c;
  while (c < n) loop
    insert into t 
    select a + (select max(a) from t)
    from t;

  select count(*) from t into c;
  end loop;
end //
delimiter ;

In a VM on a laptop with 8GB RAM, we get these results:

call inflate(32*1000*1000); /* success on 7.1.13 and 7.3.1 */

call inflate(64*1000*1000); /* OOM on 7.1.13, success on 7.3.1 */

That is, when inflating the table to over 67 million rows, it fails on 7.1.13 with an out-of-memory error, but it succeeds on 7.3.1.

The algorithms for enforcing unique keys have changed so they will work even if the data is larger than will fit in RAM. This is helpful in a number of real situations, such as when you are copying a large table with an in INSERT INTO … SELECT FROM … operation and the new target table has a unique key on it. In 7.1, you might have had to break the data into chunks and run several commands to do that. Now, you can do it in one step.

Summary

Universal storage now is maturing and has almost all the functionality needed to support applications that use to require rowstores. This is a big TCO improvement. Many shops may be able to set the default table type to columnstore now given the functional surface area supported, and this will be the default on Singlestore Helios. Finally, support for unique key enforcement regardless of data size improves reliability and simplifies development.

Read Part 1, Part 2 and Part 4 in the series to learn the whole story behind SingleStore's patented Universal Storage feature.