Memsql db crashes (cluster in a box) 7.1.7

It’s happening very ofter that when we run heavy queries the database (and the server) simply crashes. I had to restart the server and after recovering, it works again.

I set a resource pool for all the users with max of 70% of memory, but it still happening, timeout of 600 seconds, 5 concurrency and 15 queue queries.

These are the last logs that I see in the master:
FROM view_query’ submitted 419 milliseconds ago, queued for 16 milliseconds, compiled asynchronously in 403 milliseconds
6754790299059 2020-12-10 22:43:45.907 WARN: socket (153) ETIMEDOUT in recv
6754791258972 2020-12-10 22:43:46.867 INFO: Background Statistics Thread: Writing stats
6754792347109 2020-12-10 22:43:47.955 WARN: socket (145) ETIMEDOUT in recv
6754798491146 2020-12-10 22:43:54.099 WARN: socket (149) ETIMEDOUT in recv
6754806687082 2020-12-10 22:44:02.295 WARN: socket (152) ETIMEDOUT in recv
6754818971077 2020-12-10 22:44:14.579 WARN: socket (155) ETIMEDOUT in recv
6754921371092 2020-12-10 22:45:56.979 WARN: socket (108) ETIMEDOUT in recv
6754935707103 2020-12-10 22:46:11.315 WARN: socket (101) ETIMEDOUT in recv

Any clue? is this some config problem?

I don’t have an answer for you off the top of my head. If you’re a paying customer, opening a support case might be in order. I’ll ask to see if someone who knows more can help.

We’re up to 7.1.13 now, so you might try upgrading. But I don’t have specific knowledge of any bug fixes that would impact this.

1 Like

If the server crashed you should see crash reporting outpout in the tracelog (it will dump out a callstack of the crash) - I don’t see that in the snippet you pasted. If you don’t see that output it likely wasn’t a crash. The most common reason for the process to die without a crash is when linux OOM killers kills it due to not having memory limits configured properly (SingleStoreDB Cloud · SingleStore Documentation).

If you want us to look in more detail send a cluster report to bug-report@memsql.com

Thank you, I think I got it. I checked the kernel logs and you’re right… it’s an OOM issue.

I think I’m going to reduce the maximum_memory global parameter. Currently, it’s 90% of the server RAM, but I will reduce it a little more.

Dec 10 21:46:56 memsql kernel: [9248475.324923] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/memsql.service,task=memsqld,pid=1359,uid=112
Dec 10 21:46:56 memsql kernel: [9248475.326620] Out of memory: Killed process 1359 (memsqld) total-vm:30179084kB, anon-rss:25677544kB, file-rss:0kB, shmem-rss:0kB

Yes, that is one way to resolve it.

Another is to ensure you have enough swap space setup. Linux is much more trigger happy with OOM kills if there is not enough swap (we recommend 10-20% of physical memory).