Query not utilizing columnstore shard key

kshastry · June 10, 2021, 4:19am

I have a columnstore table with following sample keys.

KEY (column1, column2, column3, column4) USING CLUSTERED COLUMNSTORE,
SHARD KEY (column1, column2)

Sample query
explain Select * from table1 where column1 = ‘content’ and column2 = ‘content’;
is showing ColumnStoreScan.

According to my understanding from documentation of https://docs.singlestore.com/v7.3/guides/use-memsql/query-processing/query-tuning/query-tuning/#fanout-vs-single-partition-queries
It should have been IndexRangeScan on shard key.

Is my understanding wrong ?

Also when i query with all the columns of key, still im seeing ColumnStoreScan.

glalonde · June 11, 2021, 6:13pm

@kshastry,

I see that section doc is describing how to achieve a single partition query. Single partition queries are a great way to ensure that lightweight or transactional queries only hit a single partition which stores the data they require. This is as opposed to running them on all the leaves, which benefits large queries by parallelizing them over all the host resources on all the leaves.

You are performing the correct steps to achieve a single partition query:

Create a table with a shard key.
Filter on all the columns of the shard key.

However the way you are confirming whether a single partition query was achieved is incorrect. To check whether a query plan is being executed in a single partition manner, follow the steps below:

Take an EXPLAIN or a PROFILE of the query.
Locate the GATHER operator, it is usually one of the last operators executed, near the top.
Determine whether it says partitions:all or partitions:single

Types of keys and indexes

What you checked is still a valid concern for your query, whether or not the query is making use of an index when it initially scans the data. Please note that a shard key is not an index. In SingleStore we have these types of keys:

Clustered Index
In a clustered index data is generally stored in order of the index.

Primary key on a rowstore table (the table data is ordered on the key when data is first loaded, or after running OPTIMIZE TABLE on a rowstore table, and is reordered on each leaf when that leaf is restarted and data is recovered in order into memory)
Clustered columnstore sort key on a columnstore table (each columnstore segment is ordered according to this key, and the background merger works to sort groups of segments on this key. If you run OPTIMIZE TABLE FULL on a columnstore table, the merger will instead work in the foreground and completely reorder the entire table on the sort key in each partition)

Secondary Indexes
These work like a phonebook, enabling quick lookup of particular values.

Any standard (non-primary) key on a rowstore table.
Hash indexes on either rowstore or columnstore tables.

Unique keys
These enforce a unique constraint on the column(s) specified.

Primary key on a rowstore table
Any unique key on a columnstore or rowstore table

Additionally we have shard keys. These are not an index and have no information about the order of values in the columns. Instead, the shard key is used for distributing data across partitions in the cluster.

The columns of the shard key are hashed.
The result of the hash function determines in which partition the data will be stored.
In this way, all rows that have the same values for the columns in the shard key will be stored in the same partition, because they will hash the same way. This is even true for multiple tables: if the value of their shard key is the same, those rows will be stored in the same partition. This is integral to enabling local joins instead of distributed join operations which can be more expensive.

Note that on a rowstore if you define a primary key but no shard key, the primary key is used as the shard key. If there is no primary key (it is a columnstore table, or it is a rowstore without a primary key), then the data is keylessly sharded. This means the entire row is hashed and then distributed to a partition at random. While this is often a very even way to distribute your data (depending on the true randomness of the random function), it is not good for query performance. It completely prevents any ability to perform a local join, and can prevent other optimizations that would make use of a defined shard key.

Making use of an index in a query plan

The difference between a full scan operation or an index scan operation is whether an index is useful for scanning that data. Keep in mind that whether an index is used depends on the selectivity of your query. If the full contents of your table are represented in the result, then it is more performant to simply scan the table in order rather than using the index to seek every record in the table. Generally in a relational database, secondary indexes are useful in queries with filter selectivity of about 10% or less.

Conditions with high selectivity are things like equality comparisons x = 27. Range filters tend to have less selectivity height > 4. However this depends on what your data actually represents, eg is_backordered = 1 could be half of your data, though hopefully the supply chain is better than that!

Note that on previous versions of SingleStore, index seek was not yet supported for columnstore tables, and instead we relied on features like segment elimination to reduce the initial amount of data scanned. In recent versions the columnstore sort key can be used in a seek, and we also support pushing down some operations to the encoded data, eliminating records from the result before they are even decoded from the file on disk.

To check how data is being scanned, follow the instructions below.

Take an EXPLAIN or PROFILE as before.
Locate the initial load of data from tables, either at the very bottom of the result, or the bottom of an indented section.
Look for the mention of a key or index in that first data loading operator.

For example this is a full table scan on the table orders, and the primary key is used, so it is scanned in primary key order:

memsql> explain select * from orders;
+--------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0                                                                                                                                                        |
| Project [orders.orderkey, orders.custkey, orders.orderstatus, orders.totalprice, orders.orderdate, orders.orderpriority, orders.clerk, orders.shippriority, orders.comment, orders.created] |
| TableScan memsql_demo.orders, PRIMARY KEY (orderkey) table_type:sharded_rowstore                                                                                                            |
+--------------------------------------------------------------------------------------+

However this is an IndexSeek, because I filtered on the same column as the primary key:

memsql> explain select * from orders where orderkey = 7;
+--------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------+
| Gather partitions:single alias:remote_0                                                                                                                                                     |
| Project [orders.orderkey, orders.custkey, orders.orderstatus, orders.totalprice, orders.orderdate, orders.orderpriority, orders.clerk, orders.shippriority, orders.comment, orders.created] |
| IndexSeek memsql_demo.orders, PRIMARY KEY (orderkey) scan:[orderkey = 7] table_type:sharded_rowstore                                                                                        |
+--------------------------------------------------------------------------------------+

I also have a secondary key on custkey on this table.

memsql> show create table orders;
+--------------------------------------------------------------------------------------+
| Table  | Create Table                                                                                                            |
+--------------------------------------------------------------------------------------+
| orders | CREATE TABLE `orders` (                                                                                                 |
|        |   `orderkey` bigint(20) NOT NULL AUTO_INCREMENT,                                                                        |
|        |   `custkey` int(11) DEFAULT NULL,                                                                                       |
|        |   `orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,                                        |
|        |   `totalprice` decimal(20,2) DEFAULT NULL,                                                                              |
|        |   `orderdate` date DEFAULT NULL,                                                                                        |
|        |   `orderpriority` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,                                     |
|        |   `clerk` char(15) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,                                             |
|        |   `shippriority` int(11) DEFAULT NULL,                                                                                  |
|        |   `comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,                                        |
|        |   `created` timestamp NULL DEFAULT CURRENT_TIMESTAMP,                                                                   |
|        |   PRIMARY KEY (`orderkey`),                                                                                             |
|        |   KEY `orders_fk1` (`custkey`),                                                                                         |
|        |   KEY `orders_dt_idx` (`orderdate`),                                                                                    |
|        |   KEY `orders_crtd_dt_idx` (`created`)                                                                                  |
|        | ) AUTO_INCREMENT=12203 AUTOSTATS_CARDINALITY_MODE=PERIODIC AUTOSTATS_HISTOGRAM_MODE=CREATE SQL_MODE='STRICT_ALL_TABLES' |
+--------------------------------------------------------------------------------------+

If I filter on that the same column as a secondary key/index, the plan can still make use of an IndexRangeScan, and now the key used for that scan is the key on custkey instead of the primary key.

memsql> explain select * from orders where custkey = 7;
+--------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Gather partitions:all alias:remote_0                                                                                                                                                        |
| Project [orders.orderkey, orders.custkey, orders.orderstatus, orders.totalprice, orders.orderdate, orders.orderpriority, orders.clerk, orders.shippriority, orders.comment, orders.created] |
| IndexRangeScan memsql_demo.orders, KEY orders_fk1 (custkey) scan:[custkey = 7] table_type:sharded_rowstore                                                                                  |
+--------------------------------------------------------------------------------------+

I hope this helps to explain the difference between types of indexes and keys, and how a shard key is used to enable single partition queries.

Best,
Genevieve