A month ago we found that joining on integer / binary columns is significantly faster than char(32):
The
binarydata type was 2.3 times faster thanchar. Thebigintdata type was 1.5 times faster thanbinaryand 3.5 times faster thanchar.
What is best practice for CLUSTERED COLUMNSTORE keys - #6 by hanson - Query Tuning - SingleStore Forums
Today, we experienced this is also the case when filtering data even with full index coverage, but only on rowstore tables - with columnstore tables we experienced the opposite.
I’ll give our example to show the performance improvement:
We have a table with keywords. Each keyword has a location (geotarget). The original table used varchar to store an ISO country code of two letters, e.g. DK:
CREATE TABLE `keywords-varchar` (
`id` binary(16) NOT NULL,
`location` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `location` (`location`)
)
We decided to test with location IDs consisting of four digit numbers, eg. 1000:
CREATE TABLE `keywords-int` (
`id` binary(16) NOT NULL,
`location` int(10) unsigned DEFAULT null,
PRIMARY KEY (`id`),
KEY `location` (`location`)
)
We used a stored procedure to measure execution time while ignoring client-communication (inspired by @hanson from his blog post).
Results based on ~20M rows (all with same location):
call measure_q("select count(*) from `keywords-int` where location = 1000", 100);
155,289
call measure_q("select count(*) from `keywords-varchar` where location = 'DK'", 100);
482,675
Integers is 3 times faster!
The above queries is just to illustrate the performance difference. With actual real-application queries we’re seeing 10x faster execution.
Funny thing. We tried using mediumint (8) as well to reduce memory usage, but it performs not far as good as int (10). Not sure if we’re missing something:
CREATE TABLE `keywords-mediumint` (
`id` binary(16) NOT NULL,
`location` mediumint(8) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `location` (`location`)
)
call measure_q("select count(*) from `keywords-mediumint` where location = 2208", 100);
389,273
Disclaimer: In the example, we had a single cardinality. The results might be different with a higher cardinality. I’ll provide an update when we migrate our production environment.
However, we tried the same with our columnstore table (50M rows) but experienced a 10-20 ms slower execution time. Apparently, columnstore must have some varchar optimization in it’s compression, but based on the rowstore results I wouldn’t have expected columnstore to be slower by using integers.