SingleStore stream out to only one Kafka Partition

niro · January 18, 2022, 8:43am

Hi,
I know there is a way to stream data out from Memsql to Kafka topic using select ... into kafka, but I was wondering how the data produce stream is implemented.
I try to stream out the data from SingleStore using:

1 master aggregator
1 leaf in my cluster
5 partitions in my database where my table stored
5 partitions in my Kafka topic
The data in the table stored in all the SingleStore partitions

I found out that all the rows from my query are sent to one Kafka partition.
Is there a way to distribute the result rows among all the Kafka topic partitions?

Thanks.

yznovyak-ua · January 20, 2022, 7:11pm

Hi @niro!

We do produce messages into a random partition. However, there are some things done to ensure high throughput. We basically use the recommended settings for this, which are:

The default settings, batch.num.messages=10000 and queue.buffering.max.ms=5, are suitable for high throughput. This allows librdkafka to wait up to 5 ms for up to 10000 messages to accumulate in the local queue before sending the accumulate messages to the broker.

So if you happened to test select * from t into kafka with < 10k records records, then it is entirelu possible that you will see a skew due to this effect. But if you test with ~1M records this discrepancy should go away.

If this is not what you’re seeing, then please let us know and we can look deeper into this. Particularly it would be useful to know which version of the SingleStore you are running, how exactly does your SELECT look like, what is the version of your Kafka and if there are some non-default server/topic configs that might affect this.

niro · January 23, 2022, 4:42pm

Hi yznovyak-ua
Thanks for the answer, once I stream out more rows my result is distributed among the rest of the partitions.

Is there a way to change\update those default settings, maybe from SingleStore config file?

Thanks