Columnstore best sharding/indexing strategy on a billion-rows table

Hello!

We have been using SingleStore managed service for a year now with great success. We want to start making heavier usage of the platform, but our team is finding some difficulties when defining the schema of a billion-rows table so it can support different types of queries with sub-second performance.

The table would contain daily-aggregated data of millions of users. It would have the following columns:

user_id VARCHAR
day DATE
agg_field_1
agg_field_2

agg_field_50

Every day we have data for a few million users (from 1 to 5 million), with around 50 different aggregate values, which are integers, booleans and short strings. The cardinality of the user_id column in an entire year would be around 50 million.

We have 10 years of historical data (around 10 billion rows), and the table should support growing to 10 more years.

The complexity lies in the different type of queries that we need to perform:

  1. For a given day:

    • Aggregate the data of all the 1-5M users of that day, performing SUM, MAX, AVG and other aggregations on the agg_fields listed above

    • List the top users of that day sorting by one of the agg_fields.

  2. For a given list of user_id (from 1 to 500), and a range of days:

    • Aggregate the data for each user, performing SUM, MAX, AVG and other aggregations on the agg_fields listed above
    • Get all the agg_fields for all the days and all the users.

If we use day as the Columnstore Key and Shard Key we only get good performance on the queries that filter by day, and if we use user_id as the Columnstore Key and Shard Key we only get good performance on the queries that filter by user_id, but we haven’t been able to get good performance on all our queries.

Is the only solution to have two tables containing the same data but with different Columnstore/Shard keys, or is there a better solution?

Thanks a lot!

Did you try to use a columnstore key of day,user_id?

Yes. In that case, the queries that filter by day have great performance, but when filtering by only user_id it takes several seconds for any query to finish.

Hi Keneck! This is a fantastic problem and I am sure we can find a good solution. I am going to break down my thinking on this issue, so apologies if you already know (and have tried) most of what I am about to say.

Before we start, I am using the following schema in my examples.

create table data (
    user_id TEXT,
    day DATE,
    field_1 INT,
    field_2 INT,
    field_3 INT
)

First, for the shard key - I am not certain if you need any shard key. The main reasons you should use a shard key here is for join and group by operations, and it doesn’t sound like you are doing those things. The downside of picking a shard key is that it can skew how much CPU you are throwing at each query. Since most of your queries are large scans - the more you can balance the CPU load between partitions the better. So my suggestion here is to define SHARD () on your table to get balanced execution.

Next, for the sort key. Per day you have data for 1-5 million users. And since your columns are aggregates, I assume that you have one row per user per day. So, if we set the sort key to day then on each partition we can immediately scan the set of rows corresponding to a single day or a range. This means that for your second query set, each partition can immediately project the range of days we care about, and then we need to filter down to the set of users. So, as @zmeidav suggested, a sort key of (day, user_id) seems best. In your original question you didn’t say that you have queries which only filter by user_id. In the event you have those as well then I suggest adding a hash key on user_id which should help us filter down the segments to only the relevant rows.

Please see how the following schema works for your use case:

create table data (
    user_id TEXT,
    day DATE,
    field_1 INT,
    field_2 INT,
    field_3 INT,

    shard (),
    sort (day, user_id),
    key (user_id) using hash
);

You might also consider playing with the columnstore segment size to help the selectivity for your user_id only queries. You can do that on just this table using the syntax:

SORT KEY (day, user_id) with (columnstore_segment_rows=200000),

Make sure you are testing performance one change at a time and using Studio’s visual explain tool to help understand what SingleStore is doing under the hood.

If you still are having issues after doing these tests, please reach out and we will setup a call to figure out next steps. Cheers!

3 Likes

Hey @carl , thanks for your response!

Your suggestions were very helpful. We ended up creating the tables below, in case it can be helpful for anyone else.

create table t1 (
    user_id TEXT,
    day DATE,
    field_1 INT,
    field_2 INT,
    field_3 INT,

    shard (user_id),
    sort (day, field_1 DESC),
    unique key (user_id, day) using hash
);
create table t2 (
    user_id TEXT,
    day DATE,
    field_1 INT,
    field_2 INT,
    field_3 INT,

    shard (day),
    sort (user_id, day),
    unique key (user_id, day) using hash
);

A few observations:

  1. In the first table the sort key is (day, field_1 DESC) as the majority of the queries will obtain the top users of a given day sorting by field_1 DESC.
  2. We created the table t2 for the queries that filter by user_id or by user_id+day.
  3. While you were right that we don’t need a SHARD key, we had to include one because we wanted to add a Unique Key (user_id, day). We chose user_id in t1 and day in t2 so that the data gets distributed better and all the CPU Cores can work in parallel in all the queries. I assume sharding by (user_id, day) in both tables would have worked as well.

@carl Let me know if that makes sense or if there’s anything that could still be improved.
We’ll be loading a few billion rows in the few days, so hopefully this architecture will be optimal.

Thanks!

Hey @Keneck! Sorry for the delayed response. It’s been a week since then - curious if you ended up loading the data and checking performance? It would be great to figure out a single schema that works for everything - but this is a good step in the right direction.

Regarding q3 - it seems that sharding both tables by (user_id, day) might be a bit better since your queries should fan out better for more query shapes.

Excited to hear how this goes!