Getting Started with Distributed SQL Databases

Distributed SQL databases have significant advantages over traditional approaches such as sharding middleware and NoSQL. Read on to learn everything you need to know to get started with Distributed SQL Databases.

The Need for Distributed SQL

Developers are building new applications in the cloud at insanely fast rates, iterating through development cycles and scaling on-demand as required. Monolithic applications are turning into agile microservices, and while the focus is primarily on the application layer, there remains a heavy reliance on the data layer to maintain a reliable customer experience.

Distributed SQL databases provide consistency of developer experience and customer experience. Developers don’t need to compromise their ACID-compliant transactions, complex joins, or other table stakes database features. Likewise, data consumers get the benefit of improved performance and scalability as their own data grows. Traditional approaches, such as sharding middleware across multiple independent servers (MySQL, PostgreSQL), are limited by poor price:performance ratios.

Traditional Relational Databases vs. NoSQL Databases vs. Distributed SQL Databases

For many years, traditional databases like PostgreSQL and MySQL gave users the flexibility of ANSI SQL and ease of application connectivity, but as data volumes grew, this approach became difficult (and expensive!) to sustain.

Over time, organizations took the creative approach of sharding their single-server databases across multiple independent servers. The advantage of using sharding middleware is that SQL support is maintained. However, this band-aid approach limits users in a few ways:

Sharding: With sharding middleware, the data is distributed across multiple nodes (helping scale reads and writes), but fails to perform joins on the data across those shards without adding cross-shard joins into your application code
Slower Performance: Middleware adds an extra layer, increasing performance hops especially for writes.
Operational Complexity: Sharding middleware introduces more moving parts (and cost). For example, users must add read/write split rules for each shard and maintain those rules.

The creativity continued with the advent of NoSQL databases, this time creating an additional level of database sprawl. Organizations started siloing their workloads by data type, using technologies like MongoDB for document data, Redis for key-value, TimeScaleDB for time-series, etc. Similar to the sharded middleware approach, this can lead to large-scale data movement and thereby slower performance and complexity. A few other drawbacks include:

ACID: Not all NoSQL databases retain ACID-compliance, which is critical for integrity and consistency of data.
Standardization: Each NoSQL flavor has its own approach and there is no consistent language or data structure.
Tools: Being outside of the relational ecosystem makes it difficult to leverage common tools and consoles to access and manipulate data.

The sprawl of databases mentioned above has exacerbated the need for NewSQL databases. NewSQL is a term first defined by Matt Aslett when he was at 451 Research to describe a new, emerging class of database systems in 2011 which provide the speed and scale of NoSQL systems but retain characteristics and ACID guarantees.

Key Characteristics of a Distributed SQL Database

The primary characteristic of a distributed SQL database is the ability to shard automatically based on a declared key, thereby reducing the overhead of sharding middleware previously discussed. Contrary to shared-everything database architectures like MySQL and PostgreSQL, the Distributed SQL architecture is made up of three layers:

A SQL API.
Distributed Query Execution.
Distributed Storage.

SQL API

Having a SQL API gives users a single endpoint to access data across an entire database cluster. This makes it incredibly simple to perform CRUD operations on a database. Some distributed databases like SingleStore take the ease of use to the next level by adhering to specific standards, such as the MySQL wire protocol. This assists users in migration and overall simplicity as it relates to application compatibility.

Distributed Query Execution

Commonly, distributed database architectures are made of two layers: aggregator nodes which handle the query load from applications, and leaf nodes which house the data. Aggregators query the leaves and send results back to the client.

Both DDL and DML operations benefit heavily from Distributed SQL architectures. When tables are defined (DDL), the schema designer identifies a shard key (or the table will be randomly sharded). This determines where every newly inserted row is physically located. Rows with the same shard key value are guaranteed to be on the same partition within the database. Below are some questions to ask yourself before defining a shard key:

Does this table have a primary key? Make sure there is a shard key and that it is a subset of the primary key.
Do you frequently join on a specific set of columns (e.g. where users.id = action.user_id and users.country = action.country)? Try to make the shard key a subset of the joined columns. In this example, you would want to shard on id and country.
Do you frequently filter on a specific set of columns (e.g. where user_id = 17 and date = '2007-06-14')? Try to make the shard key a subset of the filtered columns, such as user_id and date in this example.
Do you have high concurrency queries? Choose a shard key that allows these queries to be single partition.
Do you need to UPDATE or ALTER any fields in the shard key? Remove those fields from the shard key.
Is this a small, infrequently changing table that needs to be on every node in the cluster to guarantee local joins? Use a reference table instead of sharding the table.

As you may be able to tell by the DML guidelines above, the success of your queries will be heavily dependent upon defining a proper schema:

Index Matching: For DML where filter conditions exactly match the shard key, aggregators will only query one partition. Contrarily, queries that do not match any index (including non-shard key indexes) will do a full table scan on all partitions and likely be less efficient.
Aggregator Merging: Complex queries typically push down processing to the leaves and leverage the aggregators to merge results.
- ORDER BYs leverage leaves to filter and sort data (see: sort keys) and then simply merge the results
- Aggregates leverage a concept called local-global aggregation, where each leaf computes non-associative aggregates and use the aggregators to merge and compute a final result.
- GROUP BYs efficiently leverage the power of a distributed cluster, sending the GROUP BY down so the leaves process data all the way to the final result set. The aggregator then merges these results similar to ORDER BY queries.

Finally, it is important to discuss distributed joins. Aligning the shard keys of large tables is critical as it enables collocated joins. Joins across two tables that have identical shard keys, performed on those same columns, will be performed on the same partition. Identifying join patterns in advance can enable highly effective, performant joins.

Distributed Storage

In scalable distributed systems, data is sharded and stored across nodes within a system, primarily optimizing filter and aggregation queries. Storage is also scaled simply by adding servers, an operation which nowadays can most often be performed with the click of a button.

It is important to note that even within the distributed SQL database category, there are various architectural distinctions. The shared storage model completely separates storage from compute, paying the penalty of not having data local to compute resources. Shared nothing architectures enable the best price:performance ratios by limiting data movement. Shared nothing architectures in the cloud like SingleStore’s are also starting to leverage blobstore (i.e., S3, GCS) for additional storage, enabling users to scale without diminishing TCO. Other approaches, such as that of Clickhouse, may operate in a shared nothing state but use a separate, DISTRIBUTED table type to aggregate data without actually storing anything (similar to a view).

Conclusion

In summary, distributed SQL databases have significant advantages over traditional approaches such as sharding middleware and NoSQL.

A distributed SQL Database is both horizontally scalable and elastic, especially with managed service offerings that enable one-click scaling.
Sharding offers users the ability to distribute data as they wish and according to the workload, driving impressive price:performance ratios.
Cloud native distributed SQL databases offer high availability and data resilience without the pressure of complexity. This is done by enabling replication by default and having unlimited storage to ensure data is always persisted to object storage.

SingleStore makes the concepts of distributed SQL seamless to execute with easy-to-use, declarative shard keys and sort keys, straightforward MySQL wire compatibility, and a managed cloud offering. In short, SingleStore is the easiest place to begin your foray into distributed databases. Get started at singlestore.com/cloud-trial today and get $600 in free credits!