Pipelines, one of the important features in SingleStore, are useful for ingesting different kinds of data from sources like Kafka, S3, GCS and more into SingleStore tables.

Customers also use pipelines extensively for ETL purposes, using stored procedures to transform the data from the source before loading them into tables.
SingleStore provides different tuning options at both global and pipeline levels for fine tuning ingestion performance depending on your data source, data types and data arrival intervals
1Reading any engine variable value2 >select @@pipelines_max_concurrent;3Setting any global variable:4>set global pipelines_max_concurrent= 20;
In this blog, we’ll dive deeper into some pipeline-related engine variables in detail.
- Advanced_hdfs_pipelines If you want to enable advanced Hadoop features like Kerberos authentication, turn this ON.
- Enable_eks_irsa. -To use EKS IAM roles for service accounts for managing credentials, turn this ON.
- Subprocess_ec2_metadata_timeout_ms. To provide some extra time to fetch credentials from ec2 instances increase this value.
- pipelines_stored_proc_exactly_once. This variable value should be on to get ‘exactly once’ delivery guarantees
A full list of engine variables and more details can be found in our SingleStore Docs.
Here are some additional globals and corresponding pipeline configuration options.Global | Per pipeline config |
Pipelines_batch_interval 2500msec | BATCH_INTERVAL <milliseconds> |
Pipelines_stop_on_error - on | STOP_ON_ERROR { ON | OFF } |
Pipelines_max_offsets_per_batch_partition - 1000000 | MAX_OFFSETS_PER_BATCH_PARTITION <num_offsets> |
Pipelines_max_concurrent_batch_partitions | MAX_PARTITIONS_PER_BATCH <num_partitions> |
Pipelines_max_retries_per_batch_partition - 4 | MAX_RETRIES_PER_BATCH_PARTITION<max_retries> 9.0 RETRY_OPTIONS '{ "exponential":<>, "max_retry_interval":<>, "max_retries":<>, "retry_interval":<>}' |
The per pipeline value takes precedence over global for that particular pipeline. The per pipeline values can be specified during pipeline creation — or at a later point also using the alter pipeline command.
Ingestion tuning using some engine variables:
Max_partitions_per_batch
This will determine the maximum parallelism we can achieve from the source.
It defaults to the number of SingleStore db partitions.
Pipelines_max_offsets_per_batch_partition
Control the size of the batches in a Kafka pipeline
Smaller values of
pipelines_max_offsets_per_batch_partition
can be more stable and commit more oftenVery large values of
pipelines_max_offsets_per_batch_partition
on a pipeline with a small number of Kafka partitions can incur skew if we use keyless sharding, especially coupled with a large batch interval.Too small batches will have to overcome per-batch overhead, so make sure this is big enough to do that!
Let’s break down some real-time scenarios:
If you have 16 kafka partitions and created your SingleStore database with only eight partitions, at any particular time a batch can pull data from eight Kafka partitions only. So to complete ingesting the data from 16 kafka partitions, it will require two batches — whereas having 16 database partitions will complete it in one batch.
If each Kafka partition has 1,000 offsets;limiting the Pipelines_max_offsets_per_batch_partition to 500 will increase the number of batches , as well as the time for ingesting the same data.But we must also keep in mind to not set it too high also to avoid skew in data.
To summarize, there is a lot of scope for fine-tuning the ingestion performance depending on your data requirements, data source, data arrival intervals, etc. — all of which should be considered even from database creation time.
Ready to try Pipelines? Start free with SingleStore.