ER_CANNOT_CONNECT_TO_LEAF: leaf error / with user distributed using password: YES

carrierone · June 28, 2023, 4:25pm

I have self-hosted two aggregators and two leaves which are in replication mode. I get this error a lot and have to restart memsql on the one of the leaves constantly. ERROR 1735 ER_CANNOT_CONNECT_TO_LEAF: Leaf Error (everclear4.XXXXXXXXXX:3306): Cannot connect to node @everclear4.XXXXXXXX:3306 with user distributed using password: YES [1045] Access denied for user ‘distributed’@‘IPADDRESS’ (using password: YES). The remedy is to restart memsql.service on that leaf.

I backed up table structure, dropped all the tables and this leaf, and then re-created it with sdb-admin add-leaf and still have the same symptoms.

I asked SQrL about this and this was the response:

There could be several potential causes for an ER_CANNOT_CONNECT_TO_LEAF error1. The error indicates that an aggregator node is unable to connect to a leaf node2.

One likely scenario is that when leaves connect to each other, they cache connection information (e.g. leaf-1 is at IP x.x.x.x, leaf-2is at IP y.y.y.y, etc.). If the IPs of these leaves ever change the cache will not automatically update. This will ultimately result in an unsuccessful connection attempt because the other leaves in the cluster are using outdated IP address information1.

To resolve this issue, you can flush the DNS cache and connection pools on all affected nodes with the following SQL commands1:

sql
FLUSH HOSTS;
FLUSH CONNECTION POOLS;
FLUSH HOSTS clears the DNS cache on the node, and FLUSH CONNECTION POOLS shuts down all existing connections and closes idle pooled connections1. Run these commands on all the affected nodes.

Ensure that there is network connectivity between the aggregator and leaf nodes and that no firewalls or security group settings are blocking the connections1.

If the problem persists, it may be due to a capacity limit hit. As CPU maxes out, queries will start taking longer to complete as the kernel queues them. You need to tune the max_connection_threads to find the correct balance between the count of concurrently running queries and resource utilization. As you increase the max threads per aggregator, you should monitor the resource usage (CPU, load average, memory) during a typical workload3.

These are some general remedies, however, specific problems may require specific solutions. So, you might need to evaluate your system’s logs and conditions more thoroughly1.

This error might also occur in SingleStoreDB in the managed service when leaves cannot connect to S3, for instance, due to a ‘dial tcp i/o timeout’ error. This might require checking firewall and network routing configurations4.

If you’re constantly needing to restart a specific leaf in your cluster, though, and the problem is recurring, it’s also possible that there might be a larger issue at hand, and you might want to discuss this with the SingleStoreDB support team.

Make certain to follow these steps and perform these analysis procedures in a systematic manner.

That is an interesting suggestion, no IP addresses have changed at all but it is relying on DNS records to resolve the node names to IPs. My question for the forum is, how can I check to see the records of the DNS cache before flushing them out to see what is going on and further troubleshoot this problem?

Does anyone have any other suggestions about why this could be happening? The error does not show up very often on the forum at all and even when it does its happening in my cluster very frequently not like once a day or something. Thanks.

michael.arthur · August 5, 2023, 7:31pm

See if you can correlate the error and the state of the leaf and if it occurs consistently with the same leaf. If something is causing the leaf to max out the CPU and memory, the leaf would get moved offline, allowed to recover and brought back online.

If there is a correlation, you can see if you can nail it down to an offending process or query that is causing the leaf to go offline.

Check the skew of the data as this can also stress the resources on a particular leaf which would give you similar maxing of CPU and memory.

Also consider a problem with the underlying compute specific to that leaf which results in intermittent connectivity. Try replacing / moving the VM to rule that out.