I have been looking at a number of different queries where 2 large distributed tables are INNER JOIN together. We put in SHARD KEY that would be consistent with the join keys used in the INNER JOIN.
Yet, looking at the query plan, the entire select statement are still being broadcasted across nodes.
Can you help me understand why or where else to look?
Both tables are also distributed pretty evenly across nodes with no significant skew.
SELECT A. COLUMNS, B.COLUMNS
FROM large_table_A A
INNER JOIN large_table_B B
ON A.SHARD_KEY_1 = B.SHARD_KEY_1
AND A.SHARD_KEY_2 = B.SHARD_KEY_2
AND A.SHARD_KEY_3 = B.SHARD_KEY_3
WHERE A.DATE BETWEEN DATE(‘2021-01-01’) AND DATE(‘2021-03-31’)
If both tables are sharded on (key_1,key_2,key_3) you should be getting a push down join.
Can we see the
show create table output for
large_table_B and the result of running
EXPLAIN on the query?
I just checked the SHARD KEYS on both tables. Looks like both tables has additional SHARD KEYS in addition to the join keys.
Both tables have shard key like (JOIN_KEY_1, JOIN_KEY_2,JOIN_KEY_3, OTHER_KEY_4, OTHER_KEY_5). Is that why it’s causing broadcast?
Yep, for shard keys to apply to a join all columns in the shard key need to be used in the join condition. Its not like a index where prefixes can use the index. Shard keys can only be used on exact matched for joins.
Unfortunately these 2 tables are used in other queries where the SHARD key are defined for. What would your recommendation here? Create a copy of these tables or should I add additional index to the existing tables?
Yes, adding a secondary index can help the join in some cases. For example it can enable the use of an index nested loop join. This is not quite as good as achieving a local join by setting the shard key of both tables to exactly match the join condition, but it is an improvement over a standard nested loop join and can be an improvement over a hash join in some cases (depending on the cardinality of those columns and the size of the tables involved).
However, you might check with your team if anyone is actually joining on all 5 of those columns. As Adam stated, you only get a performance boost from sharding that way, if your queries exactly match all of those columns. This can help joins, and it can also help group by if it is operating on the same set of columns as the join, or if there is no join. If nobody joins these 2 tables on all 5 of those columns, review all the workload that runs on these tables. Consider recreating them with a shard key that matches the most frequently used, largest, or most important join predicates.
When adding a secondary index, consider:
- On rowstore tables you can add many secondary indexes. Each of these is built with a skiplist index unless otherwise specified. They can be used in index seek (exact match on an equality predicate) and to a lesser degree for index range scan (useful for range predicates). The tradeoff is that you will have to store each additional index in memory so you want to keep the count below a dozen.
- For columnstore tables you can only define secondary indexes as hash indexes. Due to the structure of hash tables, these are only useful for exact match scenarios like equality predicates in a join or filter. There is a small hash table stored for each additional index. Since the performance of probing a hash table depends on its depth, you’ll get the best performance on low cardinality columns or highly selective filters.
When recreating a table with a different shard key, be sure to check that it will also distribute data relatively evenly between partitions. Otherwise it can result in data skew, which can put too much pressure on individual leaves, affecting overall query performance. Use the query in our docs to double check the data distribution effectiveness before creating new tables: data skew: resharding
Thank you for your suggestion. I made copies of both of the source tables under new SHARD KEY and was able to resolve the BROADCAST operation in the query plan.
I do, however, still have REPARTITION with the columns as part of the SELECT statement. What would be your recommendation to review for these? It is mostly concentrated on the large_table_A within the query.