Kafka Pipeline Performance Test

Hi all,

I tested the performance of the Kafka pipeline according to the number of kafca topic partitions.

Test Data size: about 5.5GB (30,000,000 rows)
1MA+4LF in 1 Host(Using one device.)
DATABASE partitions: 32
Data load in Columnstore Table. (Columnstore Index, Shard key = timestamp column)

Below is the test result.

Kafka Topic Partitions CNT MAX_PARTITIONS _PER_BATCH pipeline_name batch_id batch_time rows_streamed
1 30 pp_kafka_load 481200 419.967629 30,000,000
10 30 pp_kafka_load2 501461 427.524445 29,959,137
pp_kafka_load2 501905 0.633521 40,863
30 30 pp_kafka_load3 491164 416.204006 29,891,425
pp_kafka_load3 491876 1.716615 108,575
32 32 pp_kafka_load4 498693 416.494033 30,000,000

Is the number of partitions in Kafka’s topic not related to improving the pipeline performance?
And why did the pp_kafka_load2 and pp_kafka_load3 ran the batch twice?

I tested pipeline pp_kafka_load5 in a table with the same conditions as pp_kafka_load4 but with shardkeyless table, and I checked that the shardkey is related to pipeline’s performance.

Are there any other tuning point to improve the pipeline’s performance?

Thanks,
MK

How many SingleStore partitions do you have? In general, it’s best to have as many Kafka partitions as SingleStore partitions, and after that point performance isn’t likely to improve that much with more Kafka partitions.

Hi, Jhuang.

The SingleStore had 32 partitions.

I understood that the speed is the fastest when the number of kafka topic partitions are equal to the number of partitions of SingleStore.

But, When the number of kafka topic partitions was set to 10 or 20, the speed was rather slower than when it was 1, and performed two batches.

Can I know why that happens?