Node goes randomly offline

novalis83 · September 5, 2021, 7:26pm

Hello,
My customer says one of the node of the cluster goes randomly offline. We say twice a week. He is right because I can see that event logged in the events section in Memsql Studio.
What can I check to understand why this happens?
Thank you.

adam · September 7, 2021, 2:44am

Hi,

One way is to check the master aggregators tracelog (tracelogs/memsq.log file). It will have more detailed information about the failover. Look for traces similar to:

144740819 2019-10-30 15:28:33.883   INFO: ProcessTransactions Node 10.0.3.171:3306 heartbeat failure summary. Initial heartbeat failure at 2019-10-30 15:28:32. 210 Consecutively Missed heartbeats. Failover was triggered after 200 missed heartbeats
144740830 2019-10-30 15:28:33.883   INFO: ProcessTransactions Heartbeat connection attempts summary:
144740837 2019-10-30 15:28:33.883   INFO: ProcessTransactions   Node 10.0.3.171:3306 encountered a heartbeat connection failure (111:Connection refused) at 2019-10-30 15:28:32
144740844 2019-10-30 15:28:33.883   INFO: ProcessTransactions   Node 10.0.3.171:3306 encountered a heartbeat connection failure (111:Connection refused) at 2019-10-30 15:28:33
144740851 2019-10-30 15:28:33.884   INFO: ProcessTransactions   Node 10.0.3.171:3306 encountered a heartbeat connection failure (111:Connection refused) at 2019-10-30 15:28:33
144740858 2019-10-30 15:28:33.884   INFO: ProcessTransactions   Node 10.0.3.171:3306 heartbeat is currently attempting to reconnect

Check what the leaf that failed was up to at the time of the failover in its tracelog. Did it encounter some problem (was the host healthy, did it run out of disk, did the memsqld process crash for some reason, etc.)

-Adam