Choosing Columnstore Key for Time-series data

zachery · October 2, 2020, 7:53pm

Hello, I wanted to know how the columnstore columns are sorted, and what is best practice for creating tables where select statements often filter by timestamps.

I have two questions:

What is best practice for defining the key() of a columnstore table, can it be a timestamp? Most of our SELECT queries utilize a timestamp, so what is the impact on performance and how do we best make use of it?
Are all columnstore columns sorted, irrespective of key?

friend · October 7, 2020, 9:20pm

Generally, the goal of a columnstore key (KEY(...) USING CLUSTERED COLUMNSTORE, also referred to as a sort key) is to enable segment elimination so that we can ensure that a minimal amount of data needs to be read to service a given query; secondary benefits include presorting data for ordered scans, merge joins, etc.

Columnstore tables are stored on disk in groups of rows referred to as ‘segments’. Inside of each segment, rows are ordered based on the table’s sort key, and while segments are initially written as data is inserted into MemSQL, we asynchronously merge them to try to ensure that segments overlap as little as possible based as determined by the sort key. With respect to your second question, the ordering of the data for each column is controlled by the overall row order determined by the sort key - more information about how storage layout and sort keys interact is available in our docs on optimizing data structures.

To return to your first question, segment elimination relies on that we store metadata with min/max values for each column per segment, and then when a query is executed we can use the filters given to eliminate segments which, based on their metadata, cannot contain the requested values. Fields that you (a) commonly query on with (b) highly selective filters are ideal for a sort key, which generally is the case for the timestamp column of a TSD table.

With respect to expanding the sort key beyond a single column, the main consideration is cardinality, as latter columns only matter when there are complete matches for the preceding columns in the sort key, which is closely related to the cardinality of each column. A common solution to this is to use a computed column that applies DATE_TRUNC to allow for taking advantage of timestamp ordering while still providing effective elimination (this is discussed further in the link above).