I’m creating quick reference guidelines for key selection based on the MemSQL documentation for my team. Here is what I have for columnstore keys.
- A columnstore table has one columnstore key, up to one shard key, and no other keys including primary keys.
- The columnstore key should consist of the fields on which there are always, or most commonly, filters ordered from lowest to highest cardinality (from least to most unique).
- Queries that join columnstore tables on columnstore key fields perform better with less memory overhead.
- In general, the columnstore key should not be more precise than common query filters. For example, it may be better to include a date rather timestamp field followed by another commonly filtered field to leverage both.
I have a few questions regarding these guidelines.
- If a query does not filter or join on the first field in a columnstore key, are there any circumstances in which the remaining key fields have any benefit?
- When should shard key fields be or not be included in the columnstore key? If a table is always joined on a shard key, should it always be included in the columnstore key? If not, why does a common filter have different benefits from a common join?
- If common filter fields are not yet known, are the shard and/or (undefined) primary key fields generally the best default option? (I think so because we can at least expect them to be used in joins.)
- The documentation states that columnstore tables joined on columnstore key fields perform better with less memory overhead. Is this only true for joins on the full columnstore key? If it can be a subset, does it have to be the first field(s) similar to question 1?
- The documentation focuses on queries dictating key selection but I think that it may be a good guideline to suggest that key selection can dictate queries. One key cannot accommodate all queries, so we must guide users to query in a way that leverages the keys effectively. In some cases, a table’s data distribution may dictate a specific key and thereby specific kinds of queries to perform effectively. Does that sound reasonable?