Specify batch_size for pipelines

geet · February 25, 2021, 12:12am

MemSQL pipelines offer a few settings at the moment: BATCH_INTERVAL, MAX_PARTITIONS_PER_BATCH, etc.

Requesting for BATCH_SIZE option as well.

hanson · February 26, 2021, 8:17pm

Got it. Can you elaborate on what your scenario is and why you want to change BATCH_SIZE, what you want to set it to, and so forth?

I did open an internal feature request to track this.

geet · February 26, 2021, 8:47pm

Thank you.

In most of our cases, there is a steady flow of messages (usually <100 per batch) for the pipelines to deal with but there are times when there is a flood of 200,000+ messages. Not going to bore you with the details of why that happens… it is a business requirement.

When there is such a flood, we noticed that the cluster runs of memory, batch fails & the cluster restarts (probably because it crashed?). I now have to figure out custom solutions to deal with these messages. Not to mention, this is a disruption to it’s availability.

Yes, we have doubled the memory… and still faced issues. Regardless, for an event that happens every once in a while, it is not economical to leave all that unused memory.

If we are able to set MAX_BATCH_SIZE (or BATCH_SIZE), I am hoping it will cover our typical & the atypical scenarios.

hanson · February 26, 2021, 11:27pm

Got it. I’ll make a note of this on our internal feature request for this.

hanson · February 26, 2021, 11:31pm

So how are these 200,000+ messages represented during the “flood?” Are they individual records in a single text fail sent into the pipeline?

geet · February 27, 2021, 12:14am

They are individual messages in a kafka topic.

hanson · March 3, 2021, 2:17am

Are you using pipelines to SPs or regular pipelines when you get OOM during the “flood?”

geet · March 9, 2021, 8:17pm

Due to lack of support for transformations at Helios, we had to resort to SPs.