Hey! So I ran memsql-admin register-node -y --json --force ... and then memsql-admin start-node -y --memsql-id <ID> against a leaf node (but on the master aggregator node of course) last night and when I woke up the leaf node was still in the “Recovering” state, so I called start-node again and waited for about 10 minutes. Here’s what memsql-admin list-nodes shows me:
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| MemSQL ID | Role | Host | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 967171DBBC | Master | 10.0.45.3 | 3306 | Running | True | 6.8.11 | Online | | 0.0.0.0 |
| 101B4127FD | Leaf | 10.0.39.177 | 3306 | Running | True | 6.8.11 | Recovering | 1 | 0.0.0.0 |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
And here’s what sudo tail -n 100 /mnt/raid/memsql/tracelogs/memsql.log shows me so far:
133270214 2019-10-31 15:02:19.696 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
139790686 2019-10-31 15:02:26.217 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`ivt`
143277894 2019-10-31 15:02:29.704 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
153284812 2019-10-31 15:02:39.711 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
159675906 2019-10-31 15:02:46.102 INFO: Recovering secondary key `deft_1`.`datoms_string`.`eavt`
162415437 2019-10-31 15:02:48.841 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`eavt`
163297285 2019-10-31 15:02:49.723 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
166546699 2019-10-31 15:02:52.973 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`avt`
173305009 2019-10-31 15:02:59.731 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
174770068 2019-10-31 15:03:01.196 INFO: Recovering secondary key `deft_1`.`datoms_string`.`avt`
180796892 2019-10-31 15:03:07.223 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`vt`
183314803 2019-10-31 15:03:09.741 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
193329751 2019-10-31 15:03:19.756 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
203011969 2019-10-31 15:03:29.438 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`tt`
203347296 2019-10-31 15:03:29.773 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
208494557 2019-10-31 15:03:34.921 INFO: Recovering secondary key `deft_1`.`datoms_string`.`vt`
208868212 2019-10-31 15:03:35.294 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`p`
212574665 2019-10-31 15:03:39.001 INFO: Preparing to recover secondary keys for table `deft_2`.`datoms_short`
212574729 2019-10-31 15:03:39.001 INFO: Recovering secondary key `deft_2`.`datoms_short`.`ieavt`
212575379 2019-10-31 15:03:39.001 INFO: Recovering secondary key `deft_2`.`datoms_short`.`iavt`
212575804 2019-10-31 15:03:39.002 INFO: Recovering secondary key `deft_2`.`datoms_short`.`ivt`
212576035 2019-10-31 15:03:39.002 INFO: Recovering secondary key `deft_2`.`datoms_short`.`eavt`
212576386 2019-10-31 15:03:39.002 INFO: Recovering secondary key `deft_2`.`datoms_short`.`avt`
212576803 2019-10-31 15:03:39.003 INFO: Recovering secondary key `deft_2`.`datoms_short`.`vt`
212577053 2019-10-31 15:03:39.003 INFO: Recovering secondary key `deft_2`.`datoms_short`.`tt`
212577469 2019-10-31 15:03:39.003 INFO: Recovering secondary key `deft_2`.`datoms_short`.`p`
212577863 2019-10-31 15:03:39.004 INFO: Preparing to recover secondary keys for table `deft_2`.`datoms_string`
212765739 2019-10-31 15:03:39.192 INFO: Recovering secondary key `deft_2`.`datoms_string`.`ieavt`
213354939 2019-10-31 15:03:39.781 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
220860837 2019-10-31 15:03:47.287 INFO: Recovering secondary key `deft_2`.`datoms_string`.`iavt`
223367782 2019-10-31 15:03:49.794 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
233382269 2019-10-31 15:03:59.808 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
243391441 2019-10-31 15:04:09.817 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
245806903 2019-10-31 15:04:12.233 INFO: Recovering secondary key `deft_2`.`datoms_string`.`ivt`
253402606 2019-10-31 15:04:19.829 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
…
295026679 2019-10-31 15:05:01.453 INFO: Clearing snapshot version for database `deft_1` as it is about to go online.
295062983 2019-10-31 15:05:01.489 INFO: RecoverDatabaseThread[deft_1]: End
295483958 2019-10-31 15:05:01.910 INFO: STOP REPLICATING deft_1
295484017 2019-10-31 15:05:01.910 INFO: Thread 115083: StopSlaveThreads: Stopping slave thread for database `deft_1`.
295484070 2019-10-31 15:05:01.910 INFO: Completed STOP REPLICATING deft_1
295484093 2019-10-31 15:05:01.910 INFO: DB deft_1: has original Id 11395525453320891103.
295484104 2019-10-31 15:05:01.910 INFO: PROMOTE ALL LOCKS RELEASED deft_1
303494218 2019-10-31 15:05:09.920 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 3 out of 4
…
423633071 2019-10-31 15:07:10.059 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 3 out of 4
426466282 2019-10-31 15:07:12.892 INFO: Recovering secondary key `deft_2`.`datoms_string`.`p`
430241989 2019-10-31 15:07:16.668 INFO: [deft_2] Scanning columnstore segments: Completed, elapsed: 0 ms, disk: 0 ms
430242484 2019-10-31 15:07:16.668 INFO: Clearing snapshot version for database `deft_2` as it is about to go online.
430279079 2019-10-31 15:07:16.705 INFO: RecoverDatabaseThread[deft_2]: End
430279189 2019-10-31 15:07:16.705 INFO: Completed recovery of user databases.
430640805 2019-10-31 15:07:17.067 INFO: STOP REPLICATING deft_2
430640881 2019-10-31 15:07:17.067 INFO: Thread 115080: StopSlaveThreads: Stopping slave thread for database `deft_2`.
430640930 2019-10-31 15:07:17.067 INFO: Completed STOP REPLICATING deft_2
430640977 2019-10-31 15:07:17.067 INFO: DB deft_2: has original Id 11395525453320891103.
430640995 2019-10-31 15:07:17.067 INFO: PROMOTE ALL LOCKS RELEASED deft_2
The logs do show repeated errors, but looks like there’s been forward progress of some sort. However memsql-admin list-nodes still shows the leaf as “Recovering”, and it’s been 8 minutes since the log has been written to. I have a pretty grim outlook at the moment since my first thought is that the data is corrupted. Luckily I do have backups, but they’re potentially slightly after the last known good state; we’ll see.
Any insight you can bring to bear on this would be extremely appreciated since we’re dead in the water till we fix this! Thanks!
That does make me feel better. Still strange it’s stuck in a Recovering state. The latest output of the master aggregator log is a bunch of statistics queries and prior to that it says there is an availability group with no leaves. I have an urgent meeting so I will have to get back to you with more specifics about that within an hour or so.