Sharding Strategies
Sharding splits a dataset across multiple database nodes so each node holds a subset of the data, distributing write and storage load horizontally.
SHARDing = SHARing the Dish. You slice a pizza into pieces so multiple people eat simultaneously. Consistent hashing = a lazy Susan -- spin the table, each guest grabs the slice nearest to them.
Three main strategies: (1) Range sharding — shard by key range (user IDs 1–1M on shard 1). Simple to implement and supports range queries, but creates hot spots if data isn't uniformly distributed. (2) Hash sharding — apply a hash function to the key and mod by shard count. Distributes load evenly but makes range queries require scatter-gather across all shards. (3) Consistent hashing — place keys and nodes on a ring; a key belongs to the next node clockwise. Adding/removing a node remaps only ~1/N of keys, minimizing resharding cost.
Consistent hashing with virtual nodes (vnodes) is the production standard (used by Cassandra, DynamoDB, Riak). Each physical node owns multiple virtual positions on the ring, so load is balanced even with heterogeneous hardware and node churn is absorbed smoothly. Shard key selection is the most critical design decision: a poor shard key (e.g., timestamp for time-series data) causes write hot spots. Compound shard keys (tenant_id + entity_id) can distribute load while keeping tenant data co-located. Cross-shard transactions require 2PC or are avoided by design — most sharded systems achieve isolation by keeping related data on the same shard.
I'd use consistent hashing with virtual nodes for most sharded systems. The shard key choice is critical — I'd pick a high-cardinality key that distributes writes evenly and avoids hot spots. For a social network, I'd shard users by user_id hash; for multi-tenant SaaS, by tenant_id. I'd plan for resharding from day one by using logical shards (e.g., 1024 virtual shards mapped to N physical nodes), so scaling out means remapping virtual shards, not rehashing all data.
Sharding by a monotonically increasing key (auto-increment ID, timestamp) causes all new writes to hit the same shard — the 'hot partition' problem. Always analyze write patterns before choosing a shard key.