Designing a Real-Time Data Warehouse

Clock Icon

14 min read

Pencil Icon

Sep 4, 2024

In today’s data-driven world, many applications require the ability to run low-latency analytical queries over fresh data.

Designing a Real-Time Data Warehouse

These real-time applications typically require a combination of real-time reporting, alerting, semantic and full-text contextualization, or automated actions based on live data streams — across relational, JSON, geospatial and vector data from clickstreams, application telemetry, IoT devices, financial markets or logistics.

Real-time analytical workloads combine complex analytical queries with low latency and higher concurrency than typical warehouses (think thousands of users), and data freshness demands typically found in OLTP or operational systems. This is vital to real-time analytics, as well as for contextualization required for AI and Retrieval Augmented Generation (RAG) use cases. Traditional data warehouses often fall short in meeting these requirements. Enter the real-time data warehouse (RTDW), an advanced data architecture that supports the continuous ingestion, processing and querying of data with minimal latency. 

SingleStore has been at the forefront of pioneering real-time data warehousing for over a decade. SingleStore’s distributed SQL engine is built from the ground up to facilitate real-time analytics, with immediate availability of real-time data on ingestion while running ultra-low latency queries at abnormally high concurrency. 

In this blog, we explore why real-time data warehouses are crucial components of a modern data architecture and intelligent applications, key design principles for designing a real-time data warehouse on any solution and how SingleStore is uniquely positioned to provide an enterprise-ready, real-time data warehouse.

why-you-need-a-purpose-built-real-time-data-warehouseWhy you need a purpose-built real-time data warehouse

Imagine a bank designing a fraud detection platform to oversee customer transactions: every second of delay between the fraudulent activity and its detection is a direct risk to the bank with real-world financial implications. These delays are often the result of outdated data architectures that aren’t designed for the speed and immediacy required for a real-time analytics use case.

Unlike a traditional Enterprise Data Warehouse (EDW) that can only manage batch processing and managing large volumes of historical data with periodic updates, a real-time data warehouse (RTDW) is built to handle continuous data streams, enabling near-instantaneous querying and concurrent low latency analytics on large, constantly updated volumes. This real-time capability allows businesses to gain immediate insights and react swiftly to critical events like fraud detection or market fluctuations, where every second counts.

Traditional EDWs often falter when it comes to providing real-time data access and managing dynamic operations like upserts and updates, which are essential in environments driven by continuous transaction streams from databases or live data feeds. While EDWs are well-suited for complex, aggregate queries on historical data, their batch-oriented nature can introduce delays, making them less effective for time-sensitive decisions.

In contrast, an RTDW’s ability to analyze data as it arrives offers a distinct advantage in scenarios demanding speed and agility.  Massive overprovisioning, autoscaling and hours of manual tuning and configuration are typically required to “melt some of the snow away,” and push closer to a near-real time stream from source or ETL from a transactional database.

Some developers take an inverse approach, attempting to scale up or scale out the transactional database to adequately handle analytical query complexity on data streams as they’re written. This just swaps the problem, requiring manual intervention and overprovisioning from the ingest to the queries themselves (often analytical in nature and not optimally run in RDBMS engines), and accounting for the concurrent workloads.

Talented developers have leveraged both approaches to edge closer to near-real time, but with excessive wasted time and cost. Real-time data warehouses like SingleStore provide an alternative architecture designed for simultaneous batch and streaming ingestion, optimized to run low-latency and high-concurrency analytical queries at scale while maintaining row-level insert and select performance.

When it comes to real-time data warehousing, SingleStore stands out as a top-tier solution not only for its unmatched speed and incredible scalability, but also its simplicity. Its architecture seamlessly integrates transactional and analytical workloads, handling high-velocity data streams and complex queries with minimal latency.

design-principles-for-a-real-time-data-warehouseDesign principles for a real-time data warehouse

Regardless of the engine powering the warehouse, any truly real-time data warehouse environment has to account for a few common requirements:

  1. Real-time data ingestion with continuous inserts and upserts
  2. Low-latency processing
  3. Performance optimized for both low latency and high concurrency
  4. Scalability
  5. Ease of integration

As your use case increasingly depends on these critical factors, the limitations of general-purpose warehouses become apparent. This is where a purpose-built solution like SingleStore becomes essential, ensuring your data architecture can handle the demands of real-time applications while seamlessly integrating into your existing systems.

1-real-time-data-ingestion-streaming-continuous-inserts-upserts1. Real-time data ingestion: Streaming continuous inserts/upserts

A real-time data warehouse must maintain high-performant streaming pipelines and API- driven inserts, updates and upserts capabilities that do not impact read performance — and makes data immediately queryable. Real-time analytics cannot exist meaningfully without the ability to handle real-time data ingestion. This means the ability to perform continuous, near-real time inserts or upserts from a combination of sources. Regardless of source, pipelines should be easy to create, tune, monitor and manage at scale.

Real-time pipelines typically run alongside continuous inserts and upserts from application APIs. These writes require the same real-time execution and transactional consistency at high concurrency. A real-time data warehouse engine needs to be optimized to handle aggressive concurrent load requirements to maintain a real-time stream that is immediately queryable data, without over-utilizing resources.  

Why SingleStore. SingleStore provides highly performant writes across data streams as both traditional API or connector-driven writes, and native pipelines that subscribe to and continuously ingest Kafka topic messages or files in a bucket in parallel. This allows data to be written to SingleStore almost immediately as it lands in Kafka. Message offset can be managed and reset, and advanced registry configurations can ensure things like schema evolution for avro files, and table inference for parquet files. Kafka and SingleStore database partition count can be matched to maximize parallel streaming directly into SingleStore partitions, where it is available to query right away.

Unlike most other real-time data warehouse solutions, SingleStore supports updates, deletes, replace and importantly upserts. This should be a hard requirement — especially for event- driven architectures like monitoring. Upserts reduce the need for multiple processing stages to handle inserts and updates separately, simplifying the data pipeline and improving processing efficiency. This can lead to faster query response times, which is critical in real-time analytics where delays can significantly impact decision making. Upserts can be critical in complex data models to maintain data consistency, and enable complex business logic behind the insert.

Many, like Clickhouse, claim to be able to do upserts, but under the covers it's typically based on replace merge trees and eventual consistency — which slows upsert performance and violates ACID compliant consistency across tables and clusters. Make sure you select a real-time data warehouse with the flexibility to perform upserts as well as continuous inserts and batched updates.

2-low-latency-processing2. Low-latency processing

A real-time data warehouse must be capable of performing complex transformations, in-line indexing, aggregations and joins in near-real time to ensure data is available for robust analytical queries right as it is written. Real-time ingestion with immediately available data is the hallmark of a real-time data warehouse. Data should be immediately persisted in a high-performance warm storage layer, optimized for low-latency access and rapid query execution.

This approach circumvents the need for on-the-fly hydration from cold object storage, a frequent limitation in traditional EDW architectures that introduces significant I/O overhead and delays, thereby eroding the system's ability to deliver true real-time analytics. Performing transformations or creating indexing performed after the load increases overall time to insight by adding a second step between load and query — keeping you further from real-time nirvana.

Consider an ad-optimization platform that serves live ads to hyper-segmented target segments: this requires a real-time stream of prospect activity that can be near-instantly analyzed to create customer segmentation, serving live ads before a user leaves a page or domain. A 15-second delay between a site visit and segmentation may mean the ad doesn’t get placed in time. 

Why SingleStore. SingleStore’s patented Universal Storage combines the aggregate seek capabilities and compression of columnstore with a built-in in-memory rowstore sidecar. As data is ingested — by pipeline stream or in batch — it lands first in the in-memory rowstore layer where it is indexed and made available for instantaneous query availability, or in-line transformation before it is synchronously flushed to disk.

Once on disk the data is compressed and stored in a columnar format, where it is asynchronously written to object storage, providing unlimited storage. This means data is available in a warm query layer as it streams, not pushed up from the object onto disk like in traditional warehouse solutions. It makes near-real time data streams that include data transformations or aggregations, and renders data available immediately.

3-performance-optimized-low-latency-high-concurrency3. Performance optimized: Low latency + high concurrency

Real-time data warehouses should be designed with large-scale and complex analytics in mind, handling heavy joins and complex analytical queries in seconds or milliseconds. They should also be optimized to maintain ultra-low latency on complex analytical queries at high concurrency. Imagine a customer-facing sales operations and analytics platform where thousands of end-users perform ad-hoc searches, and refresh countless dashboards and list-serves with real-time expectations.

Even in instances where it’s okay if  the data is five minutes old, no product owner wants their end-users spending more than a few seconds waiting for a segmented list — regardless of how many other end-users are running similar requests. Its bad user experience at best, fuel for competitors at worst.

Typically, this requires a database with columnar storage and compression, along with clever indexing, partitioning and some form of built-in caching or compiler to use memory to supplement disk or blob-based seeks. A traditional data warehouse might be able to provide required performance on a recursive CTE on a few runs against a stream of data for a reporting use case, but will often fall down or auto-scale to oblivion when you try and run the same query at decent concurrency, let alone open it up to end users. Hundreds of financial analysts running the same query might see serious delays in response time.

Why SingleStore. SingleStore’s database engine is designed for extreme query performance at high concurrency. SingleStore is able to achieve this because it is built on a distributed SQL engine leveraging partitioning and multi-version controls for massive parallelism and concurrency. Analytical query performance is achieved by storing data in a highly compressed columnar format, allowing for vCPUs to scan millions of rows per second. SingleStore also has strong join performance, removing the need to denormalize your data in complex and costly transformations. SingleStore provides ample indexing options and interactive visual profiling tools to ensure maximum performance gains where required — check out this blog for a detailed guide to query tuning.

A key boost in query performance comes from SingleStore via its system of code generation. In contrast to a traditional interpreter-based execution model for databases, SingleStore embeds an industrial compiler to produce highly efficient machine code that enables low-level optimizations. In addition to these inbuilt engine features, SingleStore offers separation of compute from compute. This allows users to isolate compute resources to handle different workloads on the same data, providing them with greater control and flexibility on workload performance and costs.

4-scalability4. Scalability

Real-time data warehouses must be highly scalable to handle growing enterprise requirements, without sacrificing ingest or read performance. This is typically handled in a scale-out clustered architecture that can add or remove resources as workload demands grow or change to maintain consistently high performance. Sub-requirements should include options to control scale factors like partitioning strategies on the cluster, isolating compute resources for different use cases and autoscaling for burst-prone workloads.

Why SingleStore. SingleStore’s scalability is rooted in a dynamic scale-out architecture. In this architecture, aggregators act as query coordinators, distributing queries and writes to leaf nodes where the data is stored and processed in parallel. This distributed approach ensures that as data volumes and query loads grow, the system can scale horizontally by simply adding more leaf nodes, maintaining high performance without requiring significant reconfiguration or downtime. For short term bursts in activity, SingleStore has auto-scaling to automatically provide additional vCPU and RAM to handle the burst.

Workload management in SingleStore further bolsters scalability by allowing administrators to prioritize and control resource allocation based on workload importance. This means that critical queries can be given priority access to resources, ensuring they run quickly and efficiently — even during periods of high demand. Additionally, resource pools within SingleStore enable dynamic allocation of compute and memory resources, optimizing their use across different workloads. This flexibility ensures that resources are efficiently utilized, minimizing waste and ensuring that the system can handle peak loads effectively.

5-integration5. Integration

A real-time data warehouse needs to integrate with a variety of data sources, other databases, data warehouses and data lakes. This starts with the ability to deploy in the cloud and region of your choice, in a managed or self-managed solution or on-prem. Many RTDW options are deployable in select environments only, typically as a self-managed service in AWS —  but what if the real-time requirement is in Azure?

A real-time data warehouse needs native CDC integration options to capture data as it is committed from a transactional source in near-real time. It also requires the ability to quickly move data between the RTDW and a traditional warehouse with as little latency or transformation as possible. Finally, it must have built-in connectors and APIs to Python, dashboards and ecosystem tools to simplify integration and reduce the time and cost associated with deployment.

Why SingleStore. SingleStore provides deployment flexibility and breadth of connectivity options not found in other real-time data warehouse options. Unlike many RTDW solutions limited to specific environments, SingleStore offers both self-managed and cloud deployments available across AWS, Azure and Google Cloud, as well as on-premises. This flexibility ensures that SingleStore can be deployed close to your application infrastructure, minimizing latency and enhancing performance.

SingleStore also supports Apache Iceberg tables, enabling efficient management of large-scale analytics datasets and seamless integration with data lakes. Its native Change Data Capture (CDC) capabilities allow real-time ingestion of data from transactional databases, ensuring your analytics are always up to date. Additionally, SingleStore provides a wide range of built-in connectors and APIs, making it easy to integrate with Python, BI tools like Tableau and other data ecosystem tools, simplifying deployment and accelerating time-to-value.

customer-storiesCustomer stories

To truly understand the impact of a robust real-time data warehouse, it’s essential to look at real-world examples where businesses have transformed their operations and achieved significant results by leveraging SingleStore. Here are two customer success stories that highlight the business impact a real-time data warehouse can have.

leading-media-company-in-north-americaLeading media company in North America

Problem. A prominent media company in North America faced a significant challenge: their ad sales inventory application, powered by Redshift, was causing operational inefficiencies that impacted their bottom line. Ingesting new data from S3 into Redshift took over two hours, leading to ad spot double bookings and poor customer experiences. On top of that, their dashboards suffered from query latencies of up to five minutes — and struggled with concurrency when more than 100 users accessed them simultaneously.

Solution. The company turned to SingleStore to augment their existing Redshift setup, and the results were transformative: Data ingestion times were slashed by 99%, with new data available in under two seconds. Query latencies saw a 300x improvement, dropping from five minutes to under one second, even with thousands of concurrent users. This not only eliminated the double-booking issue but also dramatically improved the user experience, making their ad sales inventory application a model of efficiency and performance.

leading-cybersecurity-solution-providerLeading cybersecurity solution provider

Problem. In the fast-paced world of cybersecurity, speed and reliability are paramount. A tier-1 cybersecurity and threat detection provider was struggling with lagging threat detection, with delays of up to five minutes compromising their market position. Their existing solution, powered by Snowflake, couldn’t handle the demands of real-time data streaming and a rapidly increasing volume of concurrent queries — reaching upwards of 1,000 per second.

Solution. By integrating SingleStore into their data architecture, the company experienced a dramatic turnaround. Data ingestion latency was reduced by 15x, and the time to actionable insights improved by an astonishing 180x. Queries that once took several seconds to process were now being handled in under 500 milliseconds, even with thousands of concurrent users. This improvement allowed the company to maintain its competitive edge by ensuring near-instantaneous threat detection and response.

single-store-your-real-time-data-warehouseSingleStore: Your real-time data warehouse 

Real-time analytics empower businesses to act instantly on insights, driving growth and competitive advantage. Achieving this requires a data warehouse capable of real-time ingestion, low-latency processing, high performance with low latency and high concurrency, scalability and seamless integration. SingleStore is engineered from the ground up to handle these demands, delivering real-time analytics at scale.

Interested in learning more about how to build real-time data warehouses? Check out this overview of how we built a real-time digital marketing application, or give it a try here


Share