DR Replication Bandwidth Usage

Good Morning,

We have been experiencing some irregular bandwidth usage when using traditional master-slave replication within MemSQL.

Here is the scenario:
We have one Cluster (1 Master agg, 1 Child agg and 8 leafs in HA) in our own private cloud. We set up replication to our DR site in AWS (1 Master agg, 1 Child agg and 6 leafs in HA).

The database we are replicating has the following data:
MEMORY USAGE
46.89 GB
COLUMNSTORE DISK USAGE
36.45 GB

We then set up replication to our AWS DR cluster with the "REPLICATE DATABASE {databasename} FROM " Command. This took some time for all the leafs to replicate but once complete was working fine.
After about a week we looked at our bandwidth reports and noticed that we were pushing over 800Gb of data per leaf to our new AWS Cluster, PER LEAF / PER DAY. That is about 8 times the size of our entire cluster per leaf.
The reports are quite clear that it was traffic on 3306 which indicates that it is the MemSQL traffic rather than any other traffic.

My Question:
We are in the impression that replication once up and running would only replicate the delta’s / changes in Data which should be a fraction of the bandwidth usage we were seeing? What is being replicated than during the replication process which would be causing this bandwidth usage?

Kind Regards
Donovan

Hi Donovan,

Thanks for using MemSQL. Which release of MemSQL have you installed? Are you ingesting or updating a lot of records to your DB?

Thanks,

Yu-wang

Hi,

That is indeed quite strange. Replication indeed only sends the changes, not re-replicating the entire database, so it’s mostly highly dependent on your workload. In particular, if you’re running load data / pipelines in the background, those also get replicated.

There are some slight exceptions - columnstore blobs are immutable on disk, so occasionally a blob might get merged with another, causing the entire content of the blobs to replicate. This can happen if many rows are deleted from columnstore, so even taking that into account, we would expect the traffic to be proportional to the workload on the primary cluster.

Another exception is if you’re seeing a lot of reprovisions - these can happen if the DR cluster gets too far behind that the primary no longer has the deltas right after the end of the logs in the primary cluster (this is per-partition), and in that case it replicates a database from scratch. You should be able to see if this is happening in tracelogs.

I have a question regarding your reports - you mention The reports are quite clear that it was traffic on 3306 - can you provide more detail on this? Do the reports indicate that you are downloading from port 3306 on your primary, or downloading from port 3306 on your DR cluster? I ask because the DR cluster connects to the primary, and thus I wouldn’t expect to see use of port 3306 on the DR cluster, only on the primary.

Good Morning,

Thank you for the replies.
The version we are running is Version: 6.8.3. We are looking to upgrade to V7 but some of the changes which was made in V7 is a breaking change for us now.

Blockquote I have a question regarding your reports - you mention The reports are quite clear that it was traffic on 3306 - can you provide more detail on this
From the reports we can tell that the Primary cluster (leafs) is uploading the 800Gb Per leaf per day into our AWS DR cluster and the download of the same amount is happening on the leafs on our DR site. The aggregated amount come to the same uploaded and downloaded give or take a couple of Mb.

As for the reprovisioning on the Primary cluster, there wasn’t any that we could observe at the time, we had kill the replication due to the AWS costs involved replicating 1.8 Tb so there is no way to confirm that now.

Hi Donovan,

Unfortunately, without more information, we can’t really determine where that traffic is coming from.
This is not expected if your workload isn’t rewriting the entire cluster every day - your initial understanding, that the traffic should be proportional to the workload, is correct.

The reprovisioning would happen in the secondary cluster, if you still have the logs for that one, that could be helpful in determining if it happened. Any extra information you might remember from your reports could help - if you have monitoring, you could also setup a secondary cluster not on AWS to test this.

Thank you for the feedback.
Unfortunately that is the information we do have that each leaf’s total upload bandwidth was with in Mb to total download bandwidth on the AWS cluster and on port 3306.