Pipeline Tuning with Engine Variables

Pipelines, one of the important features in SingleStore, are useful for ingesting different kinds of data from sources like Kafka, S3, GCS and more into SingleStore tables.

Pipeline Tuning with Engine Variables

Customers also use pipelines extensively for ETL purposes, using stored procedures to transform the data from the source before loading them into tables.

SingleStore provides different tuning options at both global and pipeline levels for fine tuning ingestion performance depending on your data source, data types and data arrival intervals

1Reading any engine variable value2 >select @@pipelines_max_concurrent;3Setting any global variable:4>set global  pipelines_max_concurrent= 20;

In this blog, we’ll dive deeper into some pipeline-related engine variables in detail.

  • Advanced_hdfs_pipelines If you want to enable advanced Hadoop features like Kerberos authentication, turn this ON.
  • Enable_eks_irsa. -To use EKS IAM roles for service accounts for managing credentials, turn this ON.
  • Subprocess_ec2_metadata_timeout_ms. To provide some extra time to fetch credentials from ec2 instances increase this value.
  • pipelines_stored_proc_exactly_once. This variable value should be on to get ‘exactly once’ delivery guarantees

A full list of engine variables and more details can be found in our SingleStore Docs.

Here are some additional globals and corresponding pipeline configuration options.

GlobalPer pipeline config
Pipelines_batch_interval 2500msecBATCH_INTERVAL <milliseconds>

Pipelines_stop_on_error - on

STOP_ON_ERROR { ON | OFF }

Pipelines_max_offsets_per_batch_partition - 1000000

MAX_OFFSETS_PER_BATCH_PARTITION <num_offsets>

Pipelines_max_concurrent_batch_partitions

MAX_PARTITIONS_PER_BATCH <num_partitions>

Pipelines_max_retries_per_batch_partition - 4

MAX_RETRIES_PER_BATCH_PARTITION<max_retries>

9.0 RETRY_OPTIONS '{ "exponential":<>, "max_retry_interval":<>, "max_retries":<>, "retry_interval":<>}' 

The per pipeline value takes precedence over global for that particular pipeline. The per pipeline values can be specified during pipeline creation — or at a later point also using the alter pipeline command.

Ingestion tuning using some engine variables:

  • Max_partitions_per_batch 

    • This will determine the maximum parallelism we can achieve from the source.

    • It defaults to the number of SingleStore db partitions.

  • Pipelines_max_offsets_per_batch_partition

    • Control the size of the batches in a Kafka pipeline

    • Smaller values of pipelines_max_offsets_per_batch_partition can be more stable and commit more often

    • Very large values of pipelines_max_offsets_per_batch_partition on a pipeline with a small number of Kafka partitions can incur skew if we use keyless sharding, especially coupled with a large batch interval.

    • Too small batches will have to overcome per-batch overhead, so make sure this is big enough to do that!

Let’s break down some real-time scenarios:

  • If you have 16 kafka partitions and created your SingleStore database with only eight partitions, at any particular time a batch can pull data from eight Kafka partitions only. So to complete ingesting the data from 16 kafka partitions, it will require two batches — whereas having 16 database partitions will complete it in one batch.

  • If each Kafka partition has 1,000 offsets;limiting the Pipelines_max_offsets_per_batch_partition to 500 will increase the number of batches , as well as the time for ingesting the same data.But we must also keep in mind to not set it too high also to avoid skew in data.

To summarize, there is a lot of scope for fine-tuning the ingestion performance depending on your data requirements, data source, data arrival intervals, etc.  — all of which should be considered even from database creation time.

Ready to try Pipelines? Start free with SingleStore. 

Start building now

Get started with SingleStore Helios today and receive $600 in credits.

Start free

Start building with SingleStore