Yes, adding a secondary index can help the join in some cases. For example it can enable the use of an index nested loop join. This is not quite as good as achieving a local join by setting the shard key of both tables to exactly match the join condition, but it is an improvement over a standard nested loop join and can be an improvement over a hash join in some cases (depending on the cardinality of those columns and the size of the tables involved).
However, you might check with your team if anyone is actually joining on all 5 of those columns. As Adam stated, you only get a performance boost from sharding that way, if your queries exactly match all of those columns. This can help joins, and it can also help group by if it is operating on the same set of columns as the join, or if there is no join. If nobody joins these 2 tables on all 5 of those columns, review all the workload that runs on these tables. Consider recreating them with a shard key that matches the most frequently used, largest, or most important join predicates.
When adding a secondary index, consider:
- On rowstore tables you can add many secondary indexes. Each of these is built with a skiplist index unless otherwise specified. They can be used in index seek (exact match on an equality predicate) and to a lesser degree for index range scan (useful for range predicates). The tradeoff is that you will have to store each additional index in memory so you want to keep the count below a dozen.
- For columnstore tables you can only define secondary indexes as hash indexes. Due to the structure of hash tables, these are only useful for exact match scenarios like equality predicates in a join or filter. There is a small hash table stored for each additional index. Since the performance of probing a hash table depends on its depth, you’ll get the best performance on low cardinality columns or highly selective filters.
When recreating a table with a different shard key, be sure to check that it will also distribute data relatively evenly between partitions. Otherwise it can result in data skew, which can put too much pressure on individual leaves, affecting overall query performance. Use the query in our docs to double check the data distribution effectiveness before creating new tables: data skew: resharding