Solution To avoid data skew and gather partition all by adding a constant field with values 0 to 99

ravi.vignesh89 · June 13, 2022, 7:48am

Hi Team,

I have the data below represented with over 170 distinct key values which I am supposed to use a shard key column. The counts below are representation of the number of records for each key.

ColumnStore Table. The table will be sourced for analytic reports

Table1

KeyColumn Count(1)

12345 1000

67890 300

98765 5000

76543 2500

43789 100000

I am calculating one more column with the formula (row_number() over (partition by keycolumn) % 100). So that every 100 rows within each key will have a value ranging from 0 to 99 as an addition column ShardColumn.

Now I have given shardkey as (KeyColumn,ShardColumn)

And in the select where clause I am using the below where clause.
where KeyColumn = 12345 and ShardColumn in (0,1,2,3,4,5,6,7,8…99)

Now, when I did the explain and profile the gather partition is InList and also the skew percentage has reduced from 80 to 11.

I need suggestion or the comment on the above approach whether it is an optimized table design in terms of aggregation and the indexing also by avoiding the data skew.

The explain and profile has said it. But I need a direct suggestion from your team would be more helpful and confident.

MariaSilverhardt · July 1, 2022, 2:36am

Hi Ravi!

We hope this suggestion is helpful.

When picking a shard key, we want something with high cardinality to minimize data skew. Please read our best practices for selecting a shard key.For a particular table, you can test theoretical shard keys by running the below command - read more here. The goal is to have an even distribution of records across partitions.

SELECT WITH(leaf_pushdown=true) SUM(c) rows, PARTITION_ID() FROM (SELECT count(*) c FROM <table_name> GROUP BY <potential_shard_key>) reshuffle GROUP BY PARTITION_ID();

We recommend running this query to check for data skew before making shard key changes. If your data is evenly distributed across partitions, and all of your keys are variables that are often queried/filtered on, then it is a good selection.

Please keep on posted on how it goes!

Table1
KeyColumn	Count(1)
12345	1000
67890	300
98765	5000
76543	2500
43789	100000