MemSQL on K8S - WARN: Failed to free X bytes of memory at address Y to kernel. Error 12: Cannot allocate memory

Hi Team,

I deployed MemSQL on Azure Kubernetes and after running several queries successfully, I’ve noticed that the memory was not released and last query was failed:
Logs:

node 2020-09-02 14:09:35.786   WARN: Failed to free 131072 bytes of memory at address 0x7fafeade1000 to kernel. Error 12: Cannot allocate memory
node 2020-09-02 14:09:35.786   WARN: Failed to free 131072 bytes of memory at address 0x7fafeb5c1000 to kernel. Error 12: Cannot allocate memory
node 2020-09-02 14:09:35.786   WARN: Failed to free 131072 bytes of memory at address 0x7fafebd81000 to kernel. Error 12: Cannot allocate memory
node 2020-09-02 14:09:35.786   WARN: Failed to free 131072 bytes of memory at address 0x7fafec5a1000 to kernel. Error 12: Cannot allocate memory
node 2020-09-02 14:09:35.786   WARN: Failed to free 131072 bytes of memory at address 0x7fafecda1000 to kernel. Error 12: Cannot allocate memory
node 2020-09-02 14:44:22.576   WARN: [33 messages supressed Partition metadata is out of sync on table `testdb__validation_6`.`testtable`. Execution will continue but codegen will be slowed
node 2020-09-02 14:44:26.729   WARN: Failed to allocate 4096 bytes of memory from the operating system (Error 12: Cannot allocate memory). This is usually due to a misconfigured operating system or virtualization technology. See https://docs.memsql.com/troubleshooting/latest/memory-errors.
node 2020-09-02 14:44:26.729  ERROR: Failure to allocate IFN thunk page
node 2020-09-02 14:44:26.729  ERROR: Nonfatal buffer manager memory allocation failure.
node 2020-09-02 14:44:26.729  ERROR: Threads_cached :  139
node 2020-09-02 14:44:26.729  ERROR: Threads_connected :  172
node 2020-09-02 14:44:26.729  ERROR: Threads_created :  199
node 2020-09-02 14:44:26.729  ERROR: Threads_running :  1
node 2020-09-02 14:44:26.729  ERROR: Threads_background :  1
node 2020-09-02 14:44:26.729  ERROR: Threads_idle :  112
node 2020-09-02 14:44:26.729  ERROR: Ready_queue :  0
node 2020-09-02 14:44:26.729  ERROR: Idle_queue :  0
node 2020-09-02 14:44:26.729  ERROR: Context_switches :  1506
node 2020-09-02 14:44:26.729  ERROR: Context_switch_misses :  0
node 2020-09-02 14:44:26.729  ERROR: Columnstore_ingest_management_estimated_segments_to_flush :  0
node 2020-09-02 14:44:26.729  ERROR: Columnstore_ingest_management_estimated_memory :  0.000 MB
node 2020-09-02 14:44:26.729  ERROR: Threads_waiting_for_disk_space :  0
node 2020-09-02 14:44:26.729  ERROR: Total_server_memory :  27076.9 (+107.6) MB
node 2020-09-02 14:44:26.729  ERROR: Total_io_pool_memory :  7.9 MB
node 2020-09-02 14:44:26.729  ERROR: Free_io_pool_memory :  0.0 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_thread_stacks :  200.000 MB
node 2020-09-02 14:44:26.730  ERROR: Malloc_active_memory :  619.462 (+52.228) MB
node 2020-09-02 14:44:26.730  ERROR: Malloc_transaction_cached_memory :  267.883 MB
node 2020-09-02 14:44:26.730  ERROR: Linux_resident_memory :  22581.028 (+184.524) MB
node 2020-09-02 14:44:26.730  ERROR: Linux_resident_shared_memory :  111.500 (+5.000) MB
node 2020-09-02 14:44:26.730  ERROR: Buffer_manager_memory :  25756.8 MB
node 2020-09-02 14:44:26.730  ERROR: Buffer_manager_cached_memory :  23202.1 (-386.9) MB
node 2020-09-02 14:44:26.730  ERROR: Buffer_manager_unrecycled_memory :  7.9 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_skiplist_tower :  127.000 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable :  46.250 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_table_primary :  100.625 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_deleted_version :  106.000 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_internal_key_node :  37.625 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_hash_buckets :  185.034 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_table_metadata_cache :  1.250 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_code_generator :  404.271 (+404.271) MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_unit_images :  125.823 (+55.414) MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_unit_ifn_thunks :  1.715 (+0.008) MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_object_code_images :  31.072 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_compiled_unit_sections :  19.571 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_databases_list_entry :  1.000 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_plan_cache :  3.000 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_warnings :  14.000 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_replication :  1.125 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_sharding_partitions :  0.125 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_mmap_file :  80.000 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_protocol_packet :  21.375 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_large_incremental :  17.521 (+17.521) MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_table_autostats :  129.601 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_system_tasks :  0.125 (+0.125) MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_table_memory :  732.134 MB
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_16 :  allocs:139991  alloc_MB:2.1  buffer_MB:3.4  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_24 :  allocs:37920  alloc_MB:0.9  buffer_MB:1.2  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_32 :  allocs:14096  alloc_MB:0.4  buffer_MB:2.2  cached_buffer_MB:1.5  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_40 :  allocs:1301  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_48 :  allocs:1758  alloc_MB:0.1  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_56 :  allocs:273  alloc_MB:0.0  buffer_MB:0.2  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_64 :  allocs:70  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_72 :  allocs:2881  alloc_MB:0.2  buffer_MB:0.4  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_80 :  allocs:820  alloc_MB:0.1  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_88 :  allocs:210  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_104 :  allocs:229  alloc_MB:0.0  buffer_MB:0.4  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_128 :  allocs:2474  alloc_MB:0.3  buffer_MB:2.8  cached_buffer_MB:1.1  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_160 :  allocs:1999  alloc_MB:0.3  buffer_MB:0.4  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_200 :  allocs:58  alloc_MB:0.0  buffer_MB:0.2  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_248 :  allocs:24  alloc_MB:0.0  buffer_MB:0.2  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_312 :  allocs:1047  alloc_MB:0.3  buffer_MB:7.8  cached_buffer_MB:1.9  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_384 :  allocs:7  alloc_MB:0.0  buffer_MB:2.1  cached_buffer_MB:1.9  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_480 :  allocs:122  alloc_MB:0.1  buffer_MB:11.0  cached_buffer_MB:1.9  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_600 :  allocs:20  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_752 :  allocs:5  alloc_MB:0.0  buffer_MB:2.0  cached_buffer_MB:1.9  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_936 :  allocs:12  alloc_MB:0.0  buffer_MB:0.6  cached_buffer_MB:0.5  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_1168 :  allocs:6  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_1480 :  allocs:6  alloc_MB:0.0  buffer_MB:1.5  cached_buffer_MB:1.4  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_1832 :  allocs:6  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_2288 :  allocs:0  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.1  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_2832 :  allocs:28  alloc_MB:0.1  buffer_MB:0.2  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.730  ERROR: Alloc_variable_bucket_3528 :  allocs:6  alloc_MB:0.0  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_bucket_4504 :  allocs:9  alloc_MB:0.0  buffer_MB:0.8  cached_buffer_MB:0.6  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_bucket_5680 :  allocs:15  alloc_MB:0.1  buffer_MB:0.2  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_bucket_6224 :  allocs:3  alloc_MB:0.0  buffer_MB:0.6  cached_buffer_MB:0.4  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_bucket_7264 :  allocs:0  alloc_MB:0.0  buffer_MB:0.4  cached_buffer_MB:0.4  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_bucket_9344 :  allocs:122  alloc_MB:1.1  buffer_MB:1.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_bucket_65472 :  allocs:1  alloc_MB:0.1  buffer_MB:0.1  cached_buffer_MB:0.0  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_bucket_130960 :  allocs:25  alloc_MB:3.1  buffer_MB:5.0  cached_buffer_MB:1.9  
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_cached_buffers :  15.4 (+0.1) MB
node 2020-09-02 14:44:26.731  ERROR: Alloc_variable_allocated :  9.4 MB
node 2020-09-02 14:44:26.731  ERROR: GCed_versions_last_sweep :  0
node 2020-09-02 14:44:26.731  ERROR: Average_garbage_collection_duration :  14 ms

K8S agentpools:

  • memsql - 8 cores and 32gb (master)
  • memsqlleaf - 32 cores and 128gb (leaf)

‘MemsqlCluster’ K8S Deployment object:

kind: MemsqlCluster
metadata:
  name: memsql-cluster
spec:
  license: xyz
  adminHashedPassword: "xyz"
  nodeImage:
    repository: memsql/node
    tag: centos-7.1.7-27ea2acf75

  redundancyLevel: 2

  monitoringSpec:
    memsqlPusherSpec:
      enable: true
      mode: Cluster
      organizationName: cv
      kafkaBootstrapServer: kafka.monitoring.svc.cluster.local:9092
      kafkaProtocol: plaintext
      pusherSecretName: pushersecret

  serviceSpec:
    objectMetaOverrides:
      labels:
        custom: label
      annotations:
        service.beta.kubernetes.io/azure-load-balancer-internal: "true"

  aggregatorSpec:
    count: 1
    height: 0.8
    storageGB: 32
    storageClass: managed-premium

    objectMetaOverrides:
      annotations:
        optional: annotation
      labels:
        optional: label

  leafSpec:
    count: 2
    height: 3.6
    storageGB: 1024
    storageClass: managed-premium

    objectMetaOverrides:
      annotations:
        optional: annotation
      labels:
        optional: label

  schedulingDetails:
    master:
      nodeSelector:
        agentpool: memsql
    aggregator:
      nodeSelector:
        agentpool: memsql
    leaf-ag1:
      nodeSelector:
        agentpool: memsqlleaf
    leaf-ag2:
      nodeSelector:
        agentpool: memsqlleaf

Please advise,

Chen

The pod also crashed:

node 2020-09-03 09:58:53.847   WARN: Failed to allocate 13750272 bytes of memory from the operating system (Error 12: Cannot allocate memory). This is usually due to a misconfigured operating system or virtualization technology. See https://docs.memsql.com/troubleshooting/latest/memory-errors.
node 2020-09-03 09:58:53.847   FAIL: Failure to allocate code section
node +-----------------------------------------------------------------------------+
node | MemSQL has encountered a fatal error and exited.                            |
node | It could be a bug, misconfiguration, or hardware issue.                     |
node +-----------------------------------------------------------------------------+
node |                                                                             |
node | When you restart MemSQL, your data will be recovered back                   |
node | to a consistent state.                                                      |
node |                                                                             |
node | Since the 'core-file' setting is enabled a core file will now be            |
node | generated at:                                                               |
node |                                                                             |
node | /var/lib/memsql/instance/data/core                                          |
node |                                                                             |
node |                                                                             |
node | Please contact MemSQL technical support.                                    |
node |                                                                             |
node +-----------------------------------------------------------------------------+
node [memsqld_safe 2020/09/03 09:58:55 process exited with exit code -1
node stream closed

Hi @chen,

From the error output, it appears that the host system hasn’t been configured for MemSQL. We provide instruction and guidance on how to configure the hosts in our own line docs.

What’s most critical is configuring the kernel and host environment settings for the following:

Configure Linux vm settings

vm.max_map_count=1000000000
vm.min_free_kbytes=658096

Configure File Descriptor and Maximum Process Limits

memsql    soft    NOFILE    1024000
memsql    hard    NOFILE    1024000
memsql    soft    nproc     128000
memsql    hard    nproc     128000

Settings need to be made on each Kubernetes worker node where MemSQL will run.