A month ago we found that joining on
binary columns is significantly faster than
binarydata type was 2.3 times faster than
bigintdata type was 1.5 times faster than
binaryand 3.5 times faster than
What is best practice for CLUSTERED COLUMNSTORE keys
Today, we experienced this is also the case when filtering data even with full index coverage, but only on rowstore tables - with columnstore tables we experienced the opposite.
I’ll give our example to show the performance improvement:
We have a table with keywords. Each keyword has a location (geotarget). The original table used
varchar to store an ISO country code of two letters, e.g. DK:
CREATE TABLE `keywords-varchar` ( `id` binary(16) NOT NULL, `location` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL, PRIMARY KEY (`id`), KEY `location` (`location`) )
We decided to test with location IDs consisting of four digit numbers, eg. 1000:
CREATE TABLE `keywords-int` ( `id` binary(16) NOT NULL, `location` int(10) unsigned DEFAULT null, PRIMARY KEY (`id`), KEY `location` (`location`) )
Results based on ~20M rows (all with same
call measure_q("select count(*) from `keywords-int` where location = 1000", 100);
call measure_q("select count(*) from `keywords-varchar` where location = 'DK'", 100);
Integers is 3 times faster!
The above queries is just to illustrate the performance difference. With actual real-application queries we’re seeing 10x faster execution.
Funny thing. We tried using
mediumint (8) as well to reduce memory usage, but it performs not far as good as
int (10). Not sure if we’re missing something:
CREATE TABLE `keywords-mediumint` ( `id` binary(16) NOT NULL, `location` mediumint(8) unsigned DEFAULT NULL, PRIMARY KEY (`id`), KEY `location` (`location`) )
call measure_q("select count(*) from `keywords-mediumint` where location = 2208", 100);
Disclaimer: In the example, we had a single cardinality. The results might be different with a higher cardinality. I’ll provide an update when we migrate our production environment.
However, we tried the same with our columnstore table (50M rows) but experienced a 10-20 ms slower execution time. Apparently, columnstore must have some varchar optimization in it’s compression, but based on the rowstore results I wouldn’t have expected columnstore to be slower by using integers.