
Every risk committee eventually asks the same three questions about a critical piece of infrastructure: What can fail? How quickly do we recover? And what data do we lose in the process?
For a cloud database platform, these are not theoretical questions. Infrastructure fails. Availability zones go offline. Entire regions become unavailable, including the underlying cloud storage infrastructure supporting production workloads. And sometimes the most damaging event is not a cloud provider outage at all - it is an accidental deletion by someone on your own team.
Understanding a vendor's cloud disaster recovery architecture - not just their marketing claims about it - is a requirement for any enterprise deploying production workloads. This article walks through the specific failure scenarios and the mechanisms that address each one.
Two Terms You Need to Agree on Before the Conversation Starts
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are business decisions, not technical ones. But the vendor's architecture determines what is achievable. The NIST SP 800-34 Contingency Planning Guide provides authoritative definitions that are worth referencing in your DR planning:
RPO and RTO - plain language
RPO (Recovery Point Objective): How much data can your organization afford to lose, expressed as a time window? An RPO of 10 minutes means you accept losing up to 10 minutes of transactions in a disaster. An RPO of zero means every transaction must be durable.
RTO (Recovery Time Objective): How long can your organization operate without the database before the business impact becomes unacceptable? An RTO of 30 minutes means the system must be back online within half an hour of a declared disaster.
Failure Scenario 1 - A Node Goes Down
The most common failure type. A single server fails within a region. This is a hardware problem, not an outage - and it should be invisible to users.
SingleStore Helios deploys both the Control Plane and Data Plane across multiple Availability Zones within each region. Node failures cause automatic failover to healthy nodes in other Availability Zones. Clusters continue running. No user action is required.
Failure Scenario 2 - An Availability Zone Goes Offline
A step up in severity - an entire data center zone within a region becomes unavailable. Multi-AZ deployment means the cluster is distributed across zones. When one zone goes offline, the remaining zones carry the load. Recovery is measured in seconds to minutes, not hours.
Failure Scenario 3 - A Region Goes Down
This is the scenario that cloud disaster recovery plans are written for. SingleStore Helios addresses regional outages with a fully managed Smart Disaster Recovery (Smart DR). Here is how it works:
Continuous replication | Smart DR maintains continuous asynchronous replication of your databases from a primary region to a geographically separate secondary region. |
What is replicated | Not just the data. Smart DR preserves workspace topology, user accounts, permissions, pipelines, firewall settings, and configuration metadata in the secondary region. |
Failover process | Fully automated. On failover, Helios provisions compute in the secondary region, attaches the replicated databases, and exposes a new connection string for your applications. |
RPO | As low as approximately 10 minutes. This is the maximum data loss window under normal replication conditions. |
Cost model | No need to run hot compute in the DR region during normal operations. Compute is provisioned on demand at failover time, keeping ongoing DR costs low. |
Pre-provisioning | Organizations that want shorter failover times can optionally pre-provision compute in the DR region. Database branching allows DR testing without affecting production. |
Failure Scenario 4 - Accidental Deletion
Cloud provider outages make headlines. Accidental deletions cause more real-world data loss. A misconfigured script, a dropped table, a backup job that ran against the wrong environment - these are common, damaging, and often not covered by DR plans that focus exclusively on infrastructure failures.
SingleStore Helios runs automatic continuous cloud backups to durable object storage. Backups support two recovery modes depending on the edition in use:
Backup and recovery options by edition
- Standard edition - Restore: Databases can be restored to a previous backup state within the retention window (typically 7 days).
- Enterprise edition - Point-in-Time Recovery (PITR): Databases can be restored to any specific moment within the retention window - not just to a scheduled backup. This narrows the data loss window significantly for accidental deletion scenarios where the exact moment of the problem is known.
On workspace termination, compute resources are immediately deallocated and any cached data is securely purged. Encrypted database data and backups remain in object storage for the configured retention period. After the retention window expires, data is automatically and securely deleted.
The Control Plane - What Happens to the Portal During a DR Event
The Control Plane - the portal and management APIs - does not store customer data. But it governs operations like autoscaling and workspace management. A DR event that takes down the portal affects your ability to manage the platform, even if your databases continue serving queries.
Control Plane DR characteristics
The Control Plane backing database is replicated asynchronously across regions. A failover typically completes in approximately 3 minutes.
Control Plane compute runs in two Kubernetes clusters in different regions, with a warm secondary site that can be promoted quickly during a failover event.
Target RTO for the Control Plane: 30 minutes. RPO for configuration metadata: 60 minutes.
Neither figure affects the durability of customer data, which is protected separately by Smart DR and backups.
DR is executed through a standard, audited runbook - not an ad-hoc process.
Questions to Ask in Any DR Vendor Conversation
Specific numbers and mechanisms separate genuine DR capability from aspirational documentation:
- What is your RPO for regional failover, and under what conditions?
- What is your RTO for the database layer and the management layer separately?
- Is replication continuous or scheduled?
- What is replicated - only data, or configuration and metadata as well?
- Is failover automated, or does it require manual steps?
- What is the retention window for backups, and is point-in-time recovery available?
- How often do you test DR, and how is that testing documented?
A vendor with a tested DR architecture will have precise, consistent answers to all of these. Vague answers about resilience and redundancy without accompanying figures are a signal worth noting.
Download the Full SingleStore Helios Cloud Security Whitepaper
The SingleStore Helios Cloud Security White Paper covers the complete security architecture in depth - including platform architecture, network security, identity and access management, cryptography, logging and monitoring, SDLC practices, and incident management.
This Article Is Part of a Series
Enterprise Security with SingleStore Helios - 7 articles exploring every layer of cloud database security. Links will be added as each article publishes.
1 | |
2 | |
3 | |
4 | What Happens When Something Goes Wrong [You are here] |
5 | The Encryption Control Spectrum |
6 | Why Shared Responsibility Isn't a Risk Transfer |
7 | Security Engineering, Not Just Security Features |












