The recovery process seems to be stuck

Thao158 · May 19, 2025, 8:25am

Hi team,
Our singlestore cluster is deployed on OCP environment, using ICOS bucket as bottomless.
Recently we have seen the following error:
Leaf Error (node-memsql-cluster-leaf-ag2-9.svc-memsql-cluster:3306): Failed to synchronize database. SYNC POSITION timed out waiting for replay.

I have checked the recovery process on this leaf and this database and I noticed that the recovery process does not seem to change. For example, partition coedev_106 has a replay % completion stuck at 14.9% and the ETA increases over time to more than 20 days.

DATABASE_NAME,DATABASE_ID,REPLAY_ETA,REPLAY_STATUS,REPLAYED_LSN,STARTUP_COMMITED_LSN,RECOVERY_TYPE
coedev_106,48311,“14.9% done, ETA 22 days 21 hours 52 minutes 35 seconds”,Replaying log files downloaded from master,45265414,27983872,Restart
coedev_131,48138,“16.0% done, ETA 21 days 2 hours 34 minutes 51 seconds”,Replaying log files downloaded from master,45334751,28077843,Restart
coedev_80,39340,“16.3% done, ETA 20 days 12 hours 28 minutes 57 seconds”,Replaying log files downloaded from master,45327153,28079714,Restart
coedev_120,48313,“16.3% done, ETA 20 days 13 hours 27 minutes 22 seconds”,Replaying log files downloaded from master,45320522,28040183,Restart
coedev_95,48312,“21.1% done, ETA 14 days 22 hours 27 minutes 45 seconds”,Replaying log files downloaded from master,45160754,27887766,Restart
coedev_92,39341,“21.9% done, ETA 14 days 6 hours 14 minutes 27 seconds”,Replaying log files downloaded from master,45369935,28098154,Restart
coedev_127,48137,“22.0% done, ETA 14 days 4 hours 34 minutes 20 seconds”,Replaying log files downloaded from master,45116522,27852800,Restart
coedev_101,48030,“23.8% done, ETA 18 days 4 hours 28 minutes 26 seconds”,Replaying log files downloaded from master,45171119,23033073,Restart

Can anyone explain and how we can push the recovery process to complete?

mmanthena · June 3, 2025, 11:00pm

Thank you for reaching out to us. The “Failed to synchronize database. SYNC POSITION timed out waiting for replay” error is typically caused by insufficient resources to run the replay process. But, we’ll need more information to diagnose the issue.

Could you please provide your mv_cluster_status/mv_bt or a cluster report so our engineering can investigate further?

One more thing you can try is to restarting the node. This can sometimes resolve ICOS-related slowdowns or object storage access issues, which may help unblock the process.