Hi team,
Our singlestore cluster is deployed on OCP environment, using ICOS bucket as bottomless.
Recently we have seen the following error:
Leaf Error (node-memsql-cluster-leaf-ag2-9.svc-memsql-cluster:3306): Failed to synchronize database. SYNC POSITION timed out waiting for replay.
I have checked the recovery process on this leaf and this database and I noticed that the recovery process does not seem to change. For example, partition coedev_106 has a replay % completion stuck at 14.9% and the ETA increases over time to more than 20 days.
DATABASE_NAME,DATABASE_ID,REPLAY_ETA,REPLAY_STATUS,REPLAYED_LSN,STARTUP_COMMITED_LSN,RECOVERY_TYPE
coedev_106,48311,“14.9% done, ETA 22 days 21 hours 52 minutes 35 seconds”,Replaying log files downloaded from master,45265414,27983872,Restart
coedev_131,48138,“16.0% done, ETA 21 days 2 hours 34 minutes 51 seconds”,Replaying log files downloaded from master,45334751,28077843,Restart
coedev_80,39340,“16.3% done, ETA 20 days 12 hours 28 minutes 57 seconds”,Replaying log files downloaded from master,45327153,28079714,Restart
coedev_120,48313,“16.3% done, ETA 20 days 13 hours 27 minutes 22 seconds”,Replaying log files downloaded from master,45320522,28040183,Restart
coedev_95,48312,“21.1% done, ETA 14 days 22 hours 27 minutes 45 seconds”,Replaying log files downloaded from master,45160754,27887766,Restart
coedev_92,39341,“21.9% done, ETA 14 days 6 hours 14 minutes 27 seconds”,Replaying log files downloaded from master,45369935,28098154,Restart
coedev_127,48137,“22.0% done, ETA 14 days 4 hours 34 minutes 20 seconds”,Replaying log files downloaded from master,45116522,27852800,Restart
coedev_101,48030,“23.8% done, ETA 18 days 4 hours 28 minutes 26 seconds”,Replaying log files downloaded from master,45171119,23033073,Restart
Can anyone explain and how we can push the recovery process to complete?