Performance of querys joining on VARCHARs of different lengths

smaxwell · May 17, 2021, 7:46pm

While performing testing, I noticed a fairly significant difference in performance when joining two columnstore tables together where the data types were of differing lengths. In one table, the data types were VARCHAR(64) and INT. In the other, the data types were VARCHAR(200) and BIGINT.

When I changed one data type (for example, BIGINT into INT), I would get results 8% faster. If I changed both fields, the improvement increased to around 16%. I ran both versions in parallel and received similar differences in query result times.

The testing was performed in an uncontrolled environment, and there could be external factors involved, but the consistency makes me wonder if there is a benefit to matching varchar field lengths. Does different lengths of the same datatype result in reduced performance?

hanson · May 17, 2021, 10:34pm

There should not be a difference in performance comparing varchar of different lengths. It should depend on the data, not the data type.

For integers, if the types are different, a type cast would be needed, whereas if they are the same, a type cast would not be needed, which could save time. I’d recommend using the same type for integer join columns throughout your schema.

smaxwell · May 18, 2021, 1:02pm

It made no sense to me that there would be a difference in performance based on varchar length. I am going to chalk this up to an aberration. Due to the consistency of results, if I was using a more controlled environment, I would be more inclined to argue there was something else afoot.

As a follow-up, concerning the performance loss joining an INT to a BIGINT, you stated “[f]or integers, if the types are different, a type cast would be needed, whereas if they are the same, a type cast would not be needed, which could save time.” Does this mean that the cost will occur regardless of whether the conversion was implicit or explicit? Or would there be a benefit to explicitly converting the datatypes?

Due to the relative abundance of cheap storage, we are standardizing on larger data types in order to prevent both data type conversions and emergency changes due to “evolving” data. So even if there were a difference in performance due to the differences in varchar length, it will be rectified in the future.

hanson · May 19, 2021, 9:57pm

Regarding INT and BIGINT, one of our QE engineers said he didn’t think there’d be a significant difference in query performance for comparing data of these types to other values of the same type, and only a very small difference for comparing to an integer of a different type.

No, if you join an INT to a BIGINT, the casting happens internally and you don’t need to do it explicitly.

Standardizing on BIGINT for joins, makes good sense, to avoid any cast overhead and handle evolving data that may grow beyond 32 bits.

If you’re using rowstore and counting every byte to save RAM, that might argue for using INT instead, if the values fit. For columnstore, data is compressed and on disk, so space is much less of a concern.