COLUMNSTORE indexes in SingleStore 7.5

jlb · July 25, 2021, 12:10pm

Hi
I am trying to understand COLUMNSTORE indexes in SingleStore 7.5.

Let’s say I have a table as follows:
(The table has >200 columns but these are the important ones)

CREATE TABLE t(
varchar_1 NOT NULL
varchar_2 NOT NULL
varchar_3 NOT NULL
double1  NOT NULL
date_1 DEFAULT NULL
varchar_4 NOT NULL
varchar_5 NOT NULL
varchar_6 NOT NULL
bigint_1 DEFAULT 0
KEY (varchar_3) USING CLUSTERED COLUMNSTORE
SHARD KEY(varchar_3)
UNIQUE KEY(varchar_1, varchar_2, varchar_3) USING HASH
KEY (date_1) USING HASH
KEY (varchar_4) USING HASH
KEY (varchar_5) USING HASH
KEY (varchar_6) USING HASH
KEY (bigint_1) USING HASH
);

I understand the limitations regarding keys and indexes in version 7.5 are as follows:

I can still only have a single column as a shard key.
I would have liked to have the following:
KEY (varchar_1, varchar_2, varchar_3) USING CLUSTERED COLUMNSTORE
HASH KEYS can only be used in equality comparisons.
Will I therefore be able to run queries with date_1 ranges(between date x and date y)?
This is required.

My notes:

Varchar_3 is the critically important value in the table on which everything is linked. It is nearly unique. there may be say 1, 3, 5, 10 or at most 20 identical values depending on the size of the dataset. Qualified by varchar_1 and varchar_2 it becomes unique. It is not the ideal CLUSTERED COLUMNSTORE KEY(see above) I have no ideal CLUSTERED COLUMNSTORE KEY.
Using varchar_1, varchar_2 and varchar_3(or say, just varchar_2 and varchar_3, or just varchar_3) I will need sums of double_1, GROUPED BY ranges in date_1(see above) and using varchar_5, varchar_6 etc. These will be needed in many queries which include other columns in the table, as will varchar_4, varchar_5 and varchar_6.
I will often be querying using varchar_4, varchar_5 and varchar_6 usually individually but sometimes together.
I will often need a query using varchar_6 combined with bigint_1.
I will often need a sum of double_1.

I noticed in an example table in the Docs or a Blog that a UNIQUE field in an example table was made the SHARD KEY, the CLUSTERED COLUMNSTORE KEY and a HASH KEY.
There must have been a reason for that.
That reason lies at the centre of my question today.

Last question: How much disk space/memory space do HASH indexes use?
Would it be practical to have say 20 HASH indexes for possible planned queries?

Many thanks.

hanson · July 27, 2021, 12:22am

Re: 1., you can have a multi-column shard key. And you can have a multi-column sort key too.

Re: 2., hash keys can be used in any kind of comparison, but the hash indexes only will be used to execute equality lookups, or varieties thereof, such as IN filters. Other types of filters will resort to a table scan, which may be optimized by other means, like segment elimination on a datetime column used as a sort key.

Using a unique field for a shard key can be a good idea because it will result in virtually no skew in partition size (skew is generally bad). But also using it as the sort key would be unusual because you want to save your sort key for things that need fast range queries, usually.

Using a HASH key on the SHARD KEY column could make good sense if you will do equality filters on that key.

I don’t have data for you on how much space hash indexes take. I’ll ask one of the developers to chip in.

Using 20 hash indexes might make sense, but remember that every index you add will mean more overhead for inserts and deletes and some updates. 20 sounds like a lot for an update-heavy workload. But if it is read-mostly, it could be a good idea.

Physical database design (choosing your keys and indexes) is an optimization problem, and the optimal schema depends heavily on your data, queries, and workload. That’s why my answers are not definite. Focus on the top few queries in your workload, ranked by frequency times cost, when making your design decisions for keys and indexes and table type.

jlb · July 28, 2021, 1:45pm

Thank you @hanson for the tips and guidelines.

jlb · July 31, 2021, 9:42am

Hi again.
@hanson

Using my example above, would there be an advantage to use the following modified KEY design:
(My original explanation was incorrect, (varchar_2, varchar_3) combined are UNIQUE!)

KEY (varchar_2, varchar_3) USING CLUSTERED COLUMNSTORE
SHARD KEY(varchar_2, varchar_3)
UNIQUE KEY(varchar_2, varchar_3) USING HASH
KEY (varchar_4) USING HASH
KEY (varchar_5) USING HASH
KEY (varchar_6) USING HASH
KEY (bigint_1) USING HASH

The following questions apply:

Will the SHARD KEY sort (varchar_2, varchar_3). I would assume YES.
Is there still an advantage in making (varchar_2, varchar_3) the CLUSTERED COLUMNSTORE KEY?
Is there any advantage to declaring (varchar_2, varchar_3) as a UNIQUE KEY as described above?

Many thanks.

hanson · August 2, 2021, 6:22pm

It will sort on (varchar_2, varchar_3) because of the KEY… USING CLUSTERED COLUMNSTORE clause.

Usually people would not make a unique shard key their columnstore sort key. The sort key would be reserved for something else, like a datetime column used in range filters.

I would only make the (varchar_2, varchar_3) key unique if you want to enforce uniqueness, normally. There is some benefit to making it unique because the query optimizer will know that filters that compare both of those columns to a constant will always produce at most 1 row, so cost estimates could be better, and some other kinds of optimizations may be possible that rely on only one row matching a key.

jlb · August 3, 2021, 11:06am

Many thanks @hanson.
I am slowly getting the idea.