SingleStore Operator Bug Causing Partitions to be Imbalanced Across Availability Groups


We identified a bug with a recent operator version: 1.2.2-centos-93a97e50 which was published on October 1st, 2020. This issue may impact you if you run SingleStoreDB with the Kubernetes Operator.

For any customers who are currently using this version of the operator, we encourage you to upgrade to version: 1.2.3-centos-ef2b8561 or later. More information on the bug below:

The bug affects clusters which are running across 2 availability groups (AG) and can lead to a performance degradation due to master partitions becoming unbalanced across both availability groups.

Note that upgrading to this version of the operator by itself won’t trigger any immediate changes to the state of the cluster. However, any subsequent maintenance activities, such as an engine upgrade or maintenance activity, which triggers pods to be restarted, will lead to this behavior with the imbalance across availability groups.

Bug Summary

The bug was introduced by a related fix within the operator to prevent unnecessary rebalance operations. The fix mistakenly caused some rebalance operations to not run after pods are restarted. This led to a concentration of master partitions in a single availability group, instead of being spread out evenly across AG1 and AG2.

Note that this behavior does not cause the cluster to go offline, but can cause a performance degradation as noted above.

See a snippet below from the command SHOW PARTITIONS, run from within a database. The output shows Master nodes concentrated in a single availability group (AG1).

To verify if your cluster is impacted, you can either run the command SHOW PARTITIONS as noted above, or you can run the command EXPLAIN REBALANCE PARTITIONS ON <db_name>. The output of this command will show a series of promote operations to re-balance across availability groups if there is an unbalanced state detected.

Bug Mitigation

For any databases affected by this issue, run the command: REBALANCE PARTITIONS ON <db_name>.

Note that this operation requires CLUSTER permissions. It will not cause any downtime to the cluster while the rebalancing operation is run