SingleStore Kubernetes Operator Upgrade Bug

Update on 1/15/2021

We’ve released the fix to the below bug. We recommend users upgrade to the newer operator release before they perform any upgrades.

Release Information
Version: 1.2.4
Images can be found here
Release Notes


Introduction
We identified a critical bug in the Kubernetes operator, which manifests while upgrading to major versions of SingleStore. If you do not deploy your SingleStore DB instance with a self-managed Kubernetes Operator, this bug does not affect you

The issue exists in all versions of the SingleStore Kubernetes operator and affects upgrades to any major versions of SingleStore. Patch upgrades (e.g., to dot releases like 7.1.8 to 7.1.9) are not affected.

We recommend that users do not upgrade to a major version of SingleStore with the operator until a new version with the fix has been released.

Impact
The issue affects the availability of your redundant partitions (e.g., Availability Group 2) and so it poses a risk to high availability in case of a failover. It also impacts performance since partitions may not be adequately balanced across the cluster. Note that this bug does not affect the availability of your primary master partitions – these can still be accessed and your data will be available.

Symptoms
The bug occurs during an upgrade and causes a stalled recovery of a redundant partition on a particular node. Specifically, the node containers for the secondary redundant partitions will begin crashing and then restarting in sequence.

The root cause is the incorrect timing of a rebalance operation during the upgrade process.

Mitigation
We will update this post with mitigation steps as soon as possible as well with information about a new release with the fix to this issue. For now, we recommend that users do not upgrade to a major version of SingleStore with the operator until a new version with the fix has been released.

Please open a support ticket or reply to this post if you have any questions or concerns about this bug.