Upgrade to 7.5.6 destroyed the cluster

pierre · July 29, 2021, 10:40am

Hi,

I just upgraded my cluster from 7.3.5 to 7.5.6 and it destroyed my cluster.

The precheck was ok, then the upgrade went fine on the first leaf but failed to detach the second leaf.

Now the second leaf can’t start anymore with error:

05970401 2021-07-29 10:35:27.748 FATAL: Thread 115118: jumpToUpgradeStep: This node is not managed by a supported tool. Please use a toolbox version at least as new as 1.11.3.
: Failed to connect to MemSQL: process exited: exit status 1

The toolbox version installed is 1.11.9

How to solve this please? my cluster is in a weird state where the nodes don’t have same versions anymore.

Thanks for your help

PS: all nodes throw the same error after server reboot, and the cluster can’t start anymore

pierre · July 29, 2021, 4:44pm

UPDATE: found out that the file /var/lib/memsql/XXXX/memsql.cnf contained a toolbox_version = xx property that was outdated and was responsible of the upgrade failure.

The cluster is back online but the leaf that failed to upgrade is using 100% CPU and all the partitions are impacted.

The tracelogs on the failed leaf is streaming millions of errors like:

ERROR: Replication Management Thread Worker (the_db_name): Thread 112458: ProcessSingleMaster: Failed to process slaves for master database the_db_name (async non-fatal failure)

In the Studio, most of the partitions are marked as “Impacted”, what does it mean?

What would you recommend to do please?

hanson · July 29, 2021, 6:51pm

In your text, you said you upgraded to 7.3.6 but the title says 7.5.6. I think you mean 7.5.6. Please clarify. Sorry to hear about your trouble. I will ask one of the developers to take a look.

pierre · July 29, 2021, 9:19pm

Yes I upgraded to the latest version 7.5.6

mojdeh · July 29, 2021, 9:38pm

Hello pierre,
We looked at the toolbox_version issue and found where the problem is coming from. Currently working on a fix for it in Tools. Thanks for reporting the issue.
Regarding your second issue, it might be because not all nodes are upgraded. Please check to make sure all nodes are on the new version.
Also, we are curious to know why detach failed in your upgrade in the first place. It would be great if you can send us a cluster report at bug-report@memsql.com

pierre · July 29, 2021, 10:22pm

Thanks @mojdeh for working on this,

How to finish the upgrade please? the nodes now appear with the latest version after I rebooted them, but clearly partitions on the second leaf are messed up… CPU is still at 100% the whole cluster is kinda frozen.

Thanks

pierre · July 29, 2021, 10:41pm

Here is the upgrade log:

pierre · July 29, 2021, 10:51pm

NEW DISCOVERY:

I created a fresh 7.5.6 cluster and started to restore my backups.
On the leaf the tracelogs/memsql.log is also streaming millions of errors like this:

2015587609 2021-07-29 22:48:34.954 ERROR: Replication Management Thread Worker (my_db_name_0): Thread 115063: ProcessSingleMaster: Failed to process slaves for master database my_db_name (async non-fatal failure)

Feels like it’s a bug in the new release…

CLUSTER REPORT:
The cluster report is frozen on both clusters (upgraded one & fresh one), can’t get it.

mojdeh · July 29, 2021, 11:03pm

Are you a paying customer, by the way? If so, I recommend you open a support ticket for this.

pierre · July 30, 2021, 9:01am

Not yet, just an early adopter

The fresh cluster was set on Debian 10, 3 Google Cloud VMs with 8 cores (1 MA + 2 Leafs with HA) and all OS recommended optimisations (THP disabled etc…).

Even though I’m not yet paying for the license, this issue seem too serious to be left on the community support… imagine the impact of such topic on new customers reading the forum.

EDIT: the cluster report completed after few hours, sent it to bug-report@memsql.com

adam · July 30, 2021, 2:51pm

Hi pierre,

Thanks for the report.

This report is from the upgraded cluster or the fresh cluster?

-Adam

pierre · July 30, 2021, 9:50pm

It is from the upgraded cluster.

mpskovvang · August 1, 2021, 10:11pm

I had a similar issue using toolbox 1.11.9. I cannot upgrade to 7.5.6 nor 7.3.14 from 7.3.2.

The last error was: error running command: /usr/bin/scp -oBatchMode=yes -q -p -oControlPath=/tmp/singlestoredb-toolbox739529638/2.socket /tmp/singlestoredb-server-deb101326029 memsql-leaf-0:/tmp/tmp.LXK4JSkHnl: exit status 1

The same issue happens after downgrading toolbox to 1.11.3.

I recovered the cluster by uninstalling the failed version and restarting the cluster with sdb-admin start-node --all.

Resolved: The issue was caused by low disk space in the /tmp directory. Everything went smooth after uninstalling old SingleStore versions.

~~The scp command works with the -i flag using the identify file specified in the cluster config.~~

~~Setting IdentityFile in ~/.ssh/config doesn’t solve the issue.~~

pierre · August 2, 2021, 2:29pm

Thanks @mpskovvang for sharing your experience with this version.

The Singlestore team is actually investigating my cluster report and the bug seems to come from an SSL layer update in 7.5 that causes troubles between leaves… that would also explain why a fresh cluster had similar issue.

Will post the solution as soon as we find it.

pierre · August 3, 2021, 9:39am

As recommended by the Singlestore team, disabling the intra-node SSL config fixed my issue on 7.5.

adam · August 3, 2021, 3:14pm

Yes, there is an SSL performance regression in 7.5.6. Its small enough not to be noticeable if SSL is configured only between aggregators and the application (the typical configuration), but if SSL is enabled intra-cluster between the leaves the slow down is pretty drastic. Disabling intra-cluster SSL resolved things for Pierre. We have a fix in testing now and it will be released in the first 7.5 patch in a week or so.

-Adam

pierre · August 16, 2021, 12:44pm

Hi @adam, do you know if the patch has been released?
Thanks

adam · August 16, 2021, 2:07pm

Hi Pierre,

The SSL fix was release in 7.5.7 early last week. The note about SSL performance fix.

-Adam