Using SingleStore with Apache Pulsar

Hi,

We are considering using Pulsar as a message broker
What’s the most efficient way to insert data to SingleStore from pulsar?

  1. Do you have a plan to develop a pulsar or Flink sink connector?
  2. How hard is it will be for us to develop a sink connector that can insert data directly to the leaves?
  3. If we use the JDBC sink connector, what will be the performance differences VS the Kafka connector?

Thanks!

Welcome to the SingleStore community!

What kind of scale requirements do you have, like how many rows per second do you want to move into SingleStore from Pulsar?

We don’t have a plan to develop a Pulsar or Flink sink connector.

You can add multiple aggregators if you need to handle high scale, and split work across those aggregators (with app logic or a load balancer). Going through the aggregators is the standard way to program against SingleStore. Going directly to the leaves is not something that regular client application software should do.

I don’t have an answer for your 3rd question.

Welcome @sharons!

I would certainly start with the built-in JDBC connector and see how it does. To ensure maximum performance point the sink at a load balancer over all of your child aggregators (this allows you to horizontally scale ingest) and experiment with different batch sizes. I bet you will be pleasantly surprised by the out of box performance.

If you want to go farther, check out our official Kafka Confluent Connect connector for an example of how to squeeze out every drop of ingest performance. One of the key opttimizations is to switch from batch inserts to using LOAD DATA along with compressing the input stream.

Hope that helps!

@carl @hanson Thanks for the quick reply!

@carl - In your last paragraph, you’ve meant that we should implement our own connector (that reads from Pulsar) instead of the JDBC one, “mimicking” the existing Kafka connector code and use LOAD DATA and in addition, insert directly to leaves?

Also relating to “How many rows/s we intend to insert” - for now, it will be about few tens of thousands of inserts per sec, like a total of 20-30k inserts spanned across 6 big facts tables, each containing 10M-50M rows already. Half of the inserts will be “INSERT … ON DUPLICATE KEY UPDATE”.
I guess that a relatively small cluster of SingleStore should meet this load?

We are comparing performance / $ ratio with RDS and Aurora… so I should assume that since SingleStore insert performance should be much better we will need less h/w?

It’s an option althought I would start with testing the JDBC sink rather than building your own.
Also - don’t insert into the leaf nodes directly - just insert into the child aggregators, but use a load balancer to spread out the inserts over all of your aggs.

Yes - that load is very small for SingleStore assuming you set up the schemas right. I don’t think you need to build a custom connector - the normal JDBC connector should work just tfine.

That is a reasonable assumption - but I am very excited to hear the results. I highly recommend jumping on a call with one of our engineers Sarung Tripathi stripathi@singlestore.com to discuss your plans if you want to absolutely maximize your cost efficiency. :slight_smile:

@carl

  1. What is the performance diff (insert rate) between the JDBC connector or the Kafka sink connector (both go through the aggregators) VS SingleStore Kafka pipeline that can insert directly to the leafs?
    Also, how many aggregators we need to add in order to compensate for this?

  2. Just to confirm I understood your comment above on “…switch from batch inserts to using LOAD DATA along with compressing the input stream” : You suggested that in case we want additional optimization, we can adjust the JDBC connector source code to somehow be SingleStore specific and use LOAD DATA? Also, isn’t LOAD DATA is mainly to read data from files or can it also read it from a stream?

  3. Does just using bulk inserts (from a JDBC connector) also do them in parallel?

Hey @sharons, great questions!

  1. This has not been explicitly measured since it heavily depends on the cluster configuration. As a start, please test the built in JDBC sink to see if it meets your performance metrics. There are many tweaks we can do to the setup to reduce the load on the aggregators which would allow you to run this solution without excessive child aggregators. Metrics you should monitor include: CPU/network load on aggs, network load between aggs and leaves, cpu load on leaves. You can play with different SHARD keys to improve performance.

  2. Yup, that’s right. And yes when you are using LOAD DATA from a mysql library they usually just take in a file descriptor to read from. We setup the streams and configure the raw JDBC connector like so in our kafka connector: https://github.com/memsql/memsql-kafka-connector/blob/master/src/main/java/com/memsql/kafka/sink/MemSQLDbWriter.java#L58

  3. I am not entirely sure what you mean. A batch insert is a query of the form insert into... values (row 1), (row 2), .... You can open multiple connections to SingleStore and run a separate batch insert on each connection if you want to run things in parallel. The normal JDBC connectors have no concept of parallelism by default - it has to be added on in the wrapping application, for example in the Pulsar Sink connector (which probably runs batchs in parallel automatically)

1 Like

@carl re no3: load data documentation says it maximize performance by doing inserts in parallel, so i asked if the db also parallels batch inserts

Btw, do you have in case some customizable stress es tool that you provide? :slight_smile:

Ah I see - yes when you run a batch insert the individual tuples are streamed into the partitions in parallel.

You can try using https://github.com/memsql/dbbench