Database Partitioning vs. Sharding: What’s the Difference?

In this post, SingleStore Developer Advocate, Joe Karlsson, explains the differences between database sharding and database partitioning. Both are methods of breaking a large dataset into smaller subsets - but there are differences. This article explores when to use each - or even to combine them for data intensive applications.

Introduction

If you’ve spent time researching scalable database architecture techniques, chances are that you’ve run across the terms “sharding” and “partitioning.” So, what is the difference between these two concepts? Now, these two terms and concepts might seem similar at first glance. This is because sharding and partitioning are both related to breaking up a large data set into smaller subsets. The difference is that sharding implies the data is spread across multiple database instances, while partitioning does not. Let’s explore each concept in detail.

What is partitioning?

Partitioning is the database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller, individual tables, queries that access data can run faster because there is less data to scan. The main goal of partitioning is to aid in the maintenance of large tables and to reduce the overall response time to read and load data for particular SQL operations.

Horizontal partitioning

Horizontal partitioning is a technique used to divide a large dataset into smaller, more manageable pieces called partitions. Each partition contains a subset of the data, and these partitions are typically spread across multiple servers or storage devices. This method, also known as sharding, is popular for enhancing the scalability and performance of large databases.

In horizontal partitioning, each partition is defined by a partition key, which is a column or set of columns that determines the partition to which a row of data belongs. For instance, a database containing customer data might be partitioned by region, with each region’s data stored in a separate partition. This approach allows for efficient data distribution and management.

The advantages of horizontal partitioning are numerous. By dividing the data into smaller partitions, each partition can be stored on a separate server or storage device, improving performance and reducing the risk of data loss. Additionally, horizontal partitioning simplifies the management and maintenance of large datasets, as each partition can be backed up and restored independently.

However, horizontal partitioning also presents challenges. Defining an appropriate partition key is crucial, as uneven data distribution can lead to some partitions becoming overloaded, negatively impacting performance. To mitigate this, it’s essential to carefully design the partitioning scheme and monitor data distribution to ensure balanced partitions.

Vertical partitioning

Vertical table partitioning is mostly used to increase SQL Server performance, especially in cases where a query retrieves all columns from a table that contains a number of very wide text or BLOB columns. In this case to reduce access times the BLOB columns can be split into its own table. Another example is to restrict access to sensitive data, e.g., passwords, salary information, etc. Vertical partitioning splits a table into two or more tables containing different columns:

On SingleStore, by default, When you run CREATE DATABASE, SingleStore splits the database into partitions, which are distributed evenly among available nodes. This allows SingleStore to be highly available by default. With CREATE DATABASE, you can specify the number of partitions with the PARTITIONS=X option.

The master partitions are distributed evenly on nodes across the cluster. The master partitions on every node in an availability group have their replicas spread evenly among a set of nodes in the opposite availability group.

When to partition a table?

Here are some suggestions for when to partition a table:

Tables greater than 2 GB should always be considered as candidates for partitioning into multiple partitions.
Tables containing historical data, in which new data is added into the newest partition. A typical example is a historical table where only the current month’s data is updatable and the other 11 months are read only.
When the contents of a table need to be distributed across different types of storage devices.

What is sharding?

Sharding is actually a type of database partitioning, more specifically, Horizontal Partitioning. Sharding, is replicating [copying] the schema, and then dividing the data based on a shard key onto a separate database server instance, to spread load.

Every distributed table has exactly one shard key. A shard key can contain any number of columns. On SingleStore, when you run CREATE TABLE to create a table, you can specify a shard key for the table.

A table's shard key determines in which partition a given row in the table is stored. When you run an INSERT query, the node computes a hash function of the values in the column or columns that make up the shard key, which produces the partition number where the row should be stored. The node then directs the INSERT operation to the appropriate node machine and partition.

For example, the table below has the shard key that contains only the first column. All people with the same first name will be stored on the same partition.

Diagram showing how data is distributed amongst shards when the dataset is split up by a shard key.

Note: See the Optimizing Table Data Structures guide for information on how to choose a shard key.

When to shard a table?

Sharding your data can lead to many large performance improvements in your database. The following are some examples of how sharding can help improve performance:

Reduced index size - Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces
Distribute database over multiple machines - A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, greatly improving performance
Segment data by geography - In addition, if the database shard is based on some real-world segmentation of the data (e.g., European customers v. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.

When NOT to shard a table?

Sharding should be used only when all other options for optimization are inadequate. The introduced complexity of database sharding causes the following potential problems:

SQL complexity - Increased bugs because the developers have to write more complicated SQL to handle sharding logic
Additional software - that partitions, balances, coordinates, and ensures integrity can fail
Single point of failure - Corruption of one shard due to network/hardware/systems problems causes failure of the entire table.
Fail-over server complexity - Fail-over servers must have copies of the fleets of database shards.
Backups complexity - Database backups of the individual shards must be coordinated with the backups of the other shards.
Operational complexity - Adding/removing indexes, adding/deleting columns, modifying the schema becomes much more difficult.

Luckily, SingleStore manages most of these added complexities from sharding your data for you, so you don't have to worry!

Designing partitions for scalability and performance

When designing partitions for scalability and performance, several factors must be considered. First, determining the optimal partition size is crucial. This depends on the dataset’s size, available storage capacity, and performance requirements. A good rule of thumb is to aim for partitions that are large enough to contain a significant amount of data but small enough to be managed efficiently.

The partition key is another critical factor. It determines how the data is divided among the partitions and should be chosen carefully to ensure even data distribution. A good partition key is based on a column or set of columns with high cardinality, meaning they contain a large number of unique values.

In addition to the partition key, the data distribution strategy is essential. Several strategies can be employed, including range partitioning, hash partitioning, and composite partitioning. Range partitioning involves dividing the data into partitions based on a range of values, while hash partitioning uses a hash function to determine partition placement. Composite partitioning combines multiple partitioning methods to create a more complex and tailored partitioning scheme.

Finally, regular monitoring of the partitions is vital to ensure optimal performance. This includes tracking data distribution, partition size, and performance metrics such as query performance and storage utilization. By keeping a close eye on these factors, you can ensure that your partitions are functioning efficiently and effectively.

Sharding vs. partitioning: what's the difference?

Partitioning is a generic term that just means dividing your logical entities into different physical entities for performance, availability, or some other purpose. "Horizontal partitioning", or sharding, is replicating the schema, and then dividing the data based on a shard key.

On a final note, you can combine both partitioning and sharding techniques on your database. In fact, sometimes using both strategies is required for data-intensive applications.

So, now that we've discussed the difference between sharding and partitioning, what's next? If you want to play around with sharding and partitioning techniques in the cloud, the best way is to spin up a database cluster on SingleStore and try it out for yourself! You can sign up for FREE here.

Personally, I would also recommend that you check out the SingleStore Resources. There's tons of great developer projects and demos for many languages, frameworks and integrations.

The SingleStore Training page includes more self paced courses like Schema Design, Data Ingestion, Optimizing Queries, and more.

If you run into any issues or get stuck, make sure to connect with the SingleStore community. The community forums are the best place to get all your SingleStore questions answered.

Personally, I would also recommend that you check out the SingleStore Resources. There’s tons of great developer projects and demos for many languages, frameworks and integrations.

The SingleStore Training page includes more self paced courses like Schema Design, Data Ingestion, Optimizing Queries, and more.

If you run into any issues or get stuck, make sure to connect with the SingleStore community. The community forums are the best place to get all your SingleStore questions answered.

Additional Resources:

[SingleStore Documentation] High Availability and Load-Balanced Partition Placement: https://docs.singlestore.com/managed-service/en/getting-started-with-managed-service/about-managed-service/high-availability.html
[Docs] Sharding: https://docs.singlestore.com/managed-service/en/getting-started-with-managed-service/about-managed-service/sharding.html
[Wikipedia] Shard (database architecture): https://en.wikipedia.org/wiki/Partition_(database)
[Wikipedia] Partition (database): https://en.wikipedia.org/wiki/Shard_(database_architecture)

Database Partitioning vs. Sharding: What’s the Difference?

Introduction