Running SingleStore on Kubernetes in Production: An Operations Guide for Stateful Workloads

Databases are now expected to behave like cloud-native services

As organisations standardise on Kubernetes, the expectations have shifted. Databases are no longer treated as special cases - they are expected to support declarative deployment, safe upgrades, uniform scaling, and clear operational patterns, just like any other workload, even as stateful workloads and stateful applications on Kubernetes.

But running a distributed database in production on Kubernetes is fundamentally different from running stateless services. The real questions are not about deployment - they are about operations. How should nodes be sized? Where should data live? How do upgrades work without disruption? What happens when components fail?

In practice, I find that teams are usually confident in getting SingleStore running. The harder part is staying confident after the first upgrade - the first node failure, or the first time a workload grows faster than expected. This article focuses on what it actually takes to reach - and maintain - that confidence.

What this article covers

Architecture decisions that shape long-term stability
Operational patterns for upgrades, failures, and scaling
The metrics and recovery workflows that matter most
Common pitfalls and how to avoid them

Production confidence starts with knowing Kubernetes limits

Kubernetes provides powerful primitives for lifecycle management, scheduling, and infrastructure automation. However, it does not solve database-specific concerns such as persistent storage performance, data locality, or recovery behaviour in distributed systems. Those responsibilities still sit with the platform and the team operating it.

SingleStore bridges this gap through its Kubernetes Operator - a Kubernetes-native controller that encodes operational patterns specific to SingleStore’s aggregator/leaf architecture. It handles aggregator and leaf lifecycle, manages failures, and performs rolling upgrades, aligning Kubernetes primitives with the needs of a distributed database and other stateful applications, enabling declarative cluster management.

Production confidence comes from understanding this relationship clearly. The Operator simplifies operations significantly, but it does not eliminate the need for informed design decisions.

The Operator in one sentence

The SingleStore Kubernetes Operator encodes operational patterns - lifecycle, upgrades, failure handling - but it does not replace informed decisions about storage, topology, and node sizing. Those are still yours to make.

Day-0 decisions define long-term stability

Production environments are shaped by decisions made before the first query runs.

Cluster topology

SingleStore separates aggregators and leaves, each with a distinct role. Aggregators handle query coordination and concurrency, while leaves are responsible for data storage and execution. These roles are configured declaratively using the cluster specification (sdb-cluster.yaml).

Storage

Kubernetes offers multiple storage options, but they vary significantly in performance and behaviour. Leaves require consistent, high-throughput storage to support analytical workloads and recovery operations.

Node sizing

CPU, memory, and instance selection influence query execution and system stability. Following SingleStore’s system requirements and recommendations ensures consistent performance and avoids resource contention. For analytical workloads, memory is typically the constraint - leaves need headroom not just for query execution, but for Universal Storage’s buffer pool and background compaction tasks.

Topology, storage, and node sizing are tightly connected. When defined clearly at Day-0 and encoded in manifests, they create a foundation that is both repeatable and reliable.

Day-1 and Day-2 operations are where confidence is actually built

A system is not production-ready because it deploys successfully, especially when dealing with stateful workloads on Kubernetes. It is production-ready when it responds reliably under change.

Upgrades

Upgrades are a good example. With the SingleStore Operator, upgrades can be performed as controlled rolling operations - nodes are updated incrementally, allowing the cluster to remain available when properly sized.

IMPORTANT: don’t roll untested Operator upgrades in production

I always recommend staging Operator upgrades before running them in production. The Operator release notes document behavioural changes in upgrade semantics, and rolling an untested Operator version in production is an avoidable risk.
Pin your Operator and SingleStore image versions to a defined release - avoid singlestore/node:latest and validate every upgrade in staging before production.

Failure handling

When a node becomes unavailable, Kubernetes schedules a replacement pod and the Operator triggers rebalancing or recovery. But the speed and impact of recovery, and overall high availability, depend heavily on the storage layer. Restoring a leaf’s data from network storage can take materially longer than reattaching a local PV - which directly affects how long the cluster operates at full capacity.

Plan for different failure scenarios and run them in a staging environment before go-live.

Scaling

SingleStore clusters scale along two dimensions - data capacity through leaves and query concurrency through aggregators. These can be adjusted independently via the Kubernetes configuration, which is genuinely useful when a workload grows in only one dimension.

No documentation replaces running these scenarios yourself. Testing upgrades, simulating failures, and observing scaling behaviour builds real confidence that reading alone cannot.

Observability and recovery are non-negotiable

No production system can operate reliably without visibility - and for SingleStore on Kubernetes, there are specific signals worth monitoring beyond generic infrastructure metrics.

Metrics that matter

On the infrastructure side, CPU utilisation, memory pressure, disk I/O, and network throughput all integrate naturally with Prometheus, Grafana and Datadog. But from the database layer, the metrics that tend to catch problems earliest are:

Query concurrency and queue depth on aggregators - reveals whether you’re approaching concurrency limits
Disk write throughput and compaction lag on leaves - indicates whether background tasks are keeping up
Leaf rebalancing time after any node event

Establish baselines before you need themSet up monitoring in staging and establish baseline metrics before production traffic arrives. When something looks unusual in production, you need a reference point - and without one, you are guessing.

Backup and recovery

SingleStore supports consistent backup strategies and integrates with cloud storage systems. More importantly: validate your restore workflow, not just your backup.

A backup you’ve never restored is an assumption

Run restore drills regularly. A recovery process you have never executed under pressure is not a safety net - it is a hypothesis.

Common pitfalls are usually predictable

Three pitfalls I see most often

Under-sizing storage relative to memory. Leaves need sufficient disk capacity and write throughput to support Universal Storage and background tasks. I/O saturation surfaces under load in ways that are difficult to diagnose after the fact.
Treating Kubernetes as a complete solution for data locality. Cross-node network traffic still has cost. Align shard keys and cluster topology to minimise cross-node shuffles for common joins and aggregations early - testing with realistic workloads quickly reveals locality issues that synthetic benchmarks miss.
Running untested Operator upgrades in production. Pin your versions, read the release notes, and always stage first.

A practical path to production readiness

Moving from deployment to production does not require a complete redesign. It requires disciplined validation.

Start with a cluster that reflects real workload expectations. Validate storage performance and node sizing under load. Simulate failure scenarios and observe recovery behaviour. Test upgrade workflows in staging. Integrate monitoring and establish baseline metrics before you need them.

Pre-production readiness checklist

✓ Validate storage performance and node sizing under realistic load

✓ Simulate node failure and observe recovery time and cluster behaviour

✓ Test a full upgrade cycle in staging with the exact Operator version you plan to use

✓ Integrate Prometheus/Grafana or Datadog and capture baseline metrics

✓ Run a restore drill from backup before go-live

This process doesn’t need to be complex, but it does need to be intentional. In my experience, teams that skip staging validation are the ones most likely to encounter surprises in production, often ones that were easy to anticipate.

Not every team wants to own the full operational lifecycle

While Kubernetes provides flexibility and control, it also introduces ongoing responsibility - managing upgrades, handling failures, tuning performance, and maintaining reliability over time.

SingleStore Helios offers a different approach: a fully managed experience where lifecycle management and operational complexity are handled as a service. This lets teams focus on data and workloads rather than infrastructure, accelerating time to value while maintaining consistent performance and reliability.

Self-managed vs Helios: how to decide

Choose self-managed on Kubernetes when your team has the operational maturity and wants full control over infrastructure, placement, and upgrade timing.
Choose SingleStore Helios when you need to move quickly, or when the operational overhead of a distributed database would slow your team down.

The bigger picture

Running databases on Kubernetes, including distributed databases and other stateful workloads, is no longer experimental - it is becoming standard.

What differentiates successful deployments is not the ability to deploy, but the ability to operate consistently over time.

The teams I see succeed are the ones who treat production readiness as a process: they validate before go-live, they monitor from day one, and they build familiarity with failure behaviour before failure happens.

SingleStore on Kubernetes gives you the architectural foundation. When combined with thoughtful planning and validated processes, it enables teams to run distributed analytical and transactional workloads with confidence.

Running SingleStore on Kubernetes in Production

Databases are now expected to behave like cloud-native services

What this article covers

Production confidence starts with knowing Kubernetes limits