Distributed SQL databases have significant advantages over traditional approaches such as sharding middleware and NoSQL. Read on to learn everything you need to know to get started with Distributed SQL Databases.
The Need for Distributed SQL
Developers are building new applications in the cloud at insanely fast rates, iterating through development cycles and scaling on-demand as required. Monolithic applications are turning into agile microservices, and while the focus is primarily on the application layer, there remains a heavy reliance on the data layer to maintain a reliable customer experience.
Distributed SQL databases provide consistency of developer experience and customer experience. Developers don’t need to compromise their ACID-compliant transactions, complex joins, or other table stakes database features. Likewise, data consumers get the benefit of improved performance and scalability as their own data grows. Traditional approaches, such as sharding middleware across multiple independent servers (MySQL, PostgreSQL), are limited by poor price:performance ratios.
Traditional Relational Databases vs. NoSQL Databases vs. Distributed SQL Databases
For many years, traditional databases like PostgreSQL and MySQL gave users the flexibility of ANSI SQL and ease of application connectivity, but as data volumes grew, this approach became difficult (and expensive!) to sustain.
Over time, organizations took the creative approach of sharding their single-server databases across multiple independent servers. The advantage of using sharding middleware is that SQL support is maintained. However, this band-aid approach limits users in a few ways:
The creativity continued with the advent of NoSQL databases, this time creating an additional level of database sprawl. Organizations started siloing their workloads by data type, using technologies like MongoDB for document data, Redis for key-value, TimeScaleDB for time-series, etc. Similar to the sharded middleware approach, this can lead to large-scale data movement and thereby slower performance and complexity. A few other drawbacks include:
The sprawl of databases mentioned above has exacerbated the need for NewSQL databases. NewSQL is a term first defined by Matt Aslett when he was at 451 Research to describe a new, emerging class of database systems in 2011 which provide the speed and scale of NoSQL systems but retain characteristics and ACID guarantees.
Key Characteristics of a Distributed SQL Database
The primary characteristic of a distributed SQL database is the ability to shard automatically based on a declared key, thereby reducing the overhead of sharding middleware previously discussed. Contrary to shared-everything database architectures like MySQL and PostgreSQL, the Distributed SQL architecture is made up of three layers:
Having a SQL API gives users a single endpoint to access data across an entire database cluster. This makes it incredibly simple to perform CRUD operations on a database. Some distributed databases like SingleStore take the ease of use to the next level by adhering to specific standards, such as the MySQL wire protocol. This assists users in migration and overall simplicity as it relates to application compatibility.
Distributed Query Execution
Commonly, distributed database architectures are made of two layers: aggregator nodes which handle the query load from applications, and leaf nodes which house the data. Aggregators query the leaves and send results back to the client.
Both DDL and DML operations benefit heavily from Distributed SQL architectures. When tables are defined (DDL), the schema designer identifies a shard key (or the table will be randomly sharded). This determines where every newly inserted row is physically located. Rows with the same shard key value are guaranteed to be on the same partition within the database. Below are some questions to ask yourself before defining a shard key:
As you may be able to tell by the DML guidelines above, the success of your queries will be heavily dependent upon defining a proper schema:
Finally, it is important to discuss distributed joins. Aligning the shard keys of large tables is critical as it enables collocated joins. Joins across two tables that have identical shard keys, performed on those same columns, will be performed on the same partition. Identifying join patterns in advance can enable highly effective, performant joins.
In scalable distributed systems, data is sharded and stored across nodes within a system, primarily optimizing filter and aggregation queries. Storage is also scaled simply by adding servers, an operation which nowadays can most often be performed with the click of a button.
It is important to note that even within the distributed SQL database category, there are various architectural distinctions. The shared storage model completely separates storage from compute, paying the penalty of not having data local to compute resources. Shared nothing architectures enable the best price:performance ratios by limiting data movement. Shared nothing architectures in the cloud like SingleStore’s are also starting to leverage blobstore (i.e., S3, GCS) for additional storage, enabling users to scale without diminishing TCO. Other approaches, such as that of Clickhouse, may operate in a shared nothing state but use a separate, `DISTRIBUTED` table type to aggregate data without actually storing anything (similar to a view).
In summary, distributed SQL databases have significant advantages over traditional approaches such as sharding middleware and NoSQL.
SingleStore makes the concepts of distributed SQL seamless to execute with easy-to-use, declarative shard keys and sort keys, straightforward MySQL wire compatibility, and a managed cloud offering. In short, SingleStore is the easiest place to begin your foray into distributed databases. Get started at singlestore.com/free today and get \$500 in free credits!