Shutting down one availability group?

prerak · August 25, 2021, 12:30pm

Hi,

We have a singlestore installation with 2n leaves and 2 aggregators, that is configured for HA, with redundancy_level=2. In order to reduce infra costs, I’m considering shutting off half the leaf nodes and the aggregator (not master) at night/weekends, and bringing them back up in the morning on weekdays. For our use case, there will be no writes to the DB during this down time (although I expect that an occasional write would not cause a problem, and the nodes will sync up when brought back up). Do you think this is a feasible plan?

I liken this strategy to having a V8 engine in the car where all cylinders are firing at peak times, but when cruising on the highway, half of the cylinders are turned off to conserve fuel.

Are these the right steps?

Use sdb-admin stop-node to stop the leaves with availability_group = 2
Use sdb-admin stop-node to stop the non-master aggregator
Shut down servers for the nodes in step #1 and #2 at night
Bring up servers in the morning, and let the nodes start automatically and rejoin the cluster

Would you recommend any additional steps? Perhaps this is the right sequence to follow? Taking Leaves Offline without Cluster Downtime · SingleStore Documentation

Should we be turning off leaf failure detection on the master for this duration, and if so, what is the best way to do this?

Thank you for any thoughts or recommendations.

-Prerak

hanson · August 25, 2021, 10:15pm

It sounds like it will work, theoretically, but you will lose read HA during the slack period. If another node goes down, you won’t be able to run queries that touch partitions on that node. So I would not recommend going this route if you want any read HA.

prerak · August 26, 2021, 12:25pm

Understood, thanks, Eric. Yes, we would clearly not have any sort of HA once we shut down half of the DB. I’ll try it out and report back if I run into any issues.

hanson · August 26, 2021, 10:32pm

Removing nodes and rebalancing, then adding nodes back and rebalancing, is a lot easier than what you propose, but it does move a lot of data around, so might not be suitable for you.

Also, to do what you suggest, you would have to make sure you can get your storage volumes and IP addresses back when you bring the nodes back that you shut down.