Sharding
October 30, 2023
Overview
The concept of vertical vs horizontal scaling doesn't only exist with application servers. It exists with database servers as well. Horizontally scaling a database comes in two flavors:
- Partitioning
- Sharding
Database Table Partitioning
Partitioning is much simpler with fewer risks than sharding. All it takes is adding a PARTITION BY
clause when creating the
table (if the context is SQL). The concept is so simple that it's relatively easy to implement imperatively as well. The
effect of this is separating a single table into multiple partitions based on ranges of rows according to a value or values in one
or more columns. The result is fewer conflicts due to the fact that each partition is stored in different files on disk. When data is
grouped less closely together, seek times are reduced.
The most pronounced benefit of partitioning is in queries that scan many rows since these scans can be distributed in parallel. On the other hand, queries that utilize indexes or keys see marginal benefits.
If not configured correctly or for the correct use case, partitioning can even decrease performance. For example:
- Partitioning does increase complexity, thereby increasing overhead (such as query planning time), which is why it shouldn't be used for small tables.
- Partitioning on the wrong column (based on query patterns) is an obvious mistake that nullifies the benefit.
- Partitioning can impact how indexes are managed and accessed. Indexes need to be aligned with partitions.
- Unevenly distributed data per partition can cause uneven load and performance issues.
Database Sharding
Whereas partitioning splits a single database table into multiple partitions, sharding splits an entire database based
on a value or values in one or more columns of a table (usually just one). All tables related to this primary table (that contains
the column responsible for splitting it into ranges) are duplicated. The table that is usually used for such a split is
the customers
table and the column is CustomerId
. The database is split into separate nodes (or instances), which allows
for horizontal scaling of compute (rather than horizontal scaling of a single table on the same compute node). In front of the
database nodes sits a router to route requests to the appropriate database based on the column value(s) that the schema is sharded on.
To dive deeper and learn about the different ways to shard a database: What is Database Sharding?
Sharding has the benefit of reducing the blast radius of any failure to a subset of the entire dataset. There is a downside in this trade-off in that there are more potential points of failure given the additional nodes. However, this downside can be offset with the use of redundant replicas on each node.
Full-Stack Sharding
Full-Stack sharding is when each shard contains a copy of the entire system. Each segment (or shard) handles the segment of
traffic that is routed to it via the router. CustomerId
would, again, be a common column to route on.
CAP Theorem
The CAP Theorem states that a database system can only provide two out of the three: Consistency, Availability, and Partition Tolerance. As time has passed, this rule still stands, but the window of time where any two are provided has shrunk drastically, making it appear as though all three exist at all times. An example of this is a sharded database where each shard has a redundant replica. During normal operation, it provides Consistency and Availability. But, during a network failure, to maintain Partition Tolerance, the database system must decide on a trade-off between Consistency and Availability. The latter is typically preferred. Modern hardware (like SSDs) shrink this window.