Sharding Strategies
Sharding distributes data across multiple database instances (shards), each holding a subset of the data. Unlike partitioning (which splits within a single instance), sharding splits across separate servers for horizontal scaling. Common strategies: hash-based (hash the shard key and modulo by shard count), range-based (shard by date or ID range), and directory-based (a lookup table maps keys to shards). Sharding adds significant complexity — only do it when a single instance truly cannot handle the load.
Sharding is splitting a restaurant into separate kitchens: Kitchen A handles customers A-M, Kitchen B handles N-Z. Fast per-kitchen, but if a couple with last names 'Adams' and 'Zhang' want to share a meal (cross-shard join), a waiter has to run between buildings. Don't build two kitchens until one is truly full.
Shard key selection is the most critical decision. A good shard key has high cardinality (many unique values), distributes data and queries evenly, and aligns with your most common access pattern (queries should hit a single shard whenever possible). Cross-shard queries (joins across shards) are expensive and should be avoided by co-locating related data on the same shard. Cross-shard transactions require two-phase commit or saga patterns. Rebalancing (adding or removing shards) is painful with simple hash-based sharding because it requires moving data. Consistent hashing reduces data movement during rebalancing by only reassigning keys near the new node. MongoDB handles sharding natively with mongos routing and config servers. PostgreSQL sharding options include Citus (extension) or application-level sharding.
Sharding anti-patterns: sharding too early (single-instance PostgreSQL handles multi-TB with proper optimization), choosing a low-cardinality shard key (creates hot shards), and not planning for shard rebalancing. Vitess (used by YouTube, Slack, PlanetScale) provides MySQL sharding with a proxy layer that handles routing, resharding, and schema migrations. CockroachDB and TiDB provide automatic sharding with distributed SQL, avoiding the operational complexity of manual sharding at the cost of higher per-query latency. Application-level sharding (your code decides which shard to query) is the most flexible but puts the burden on the application — every query must include the shard key, every migration must run on every shard, and cross-shard operations require custom orchestration. Consider application-level sharding only when you have exhausted single-instance optimizations and your team has the expertise to maintain it.
I treat sharding as a last resort after exhausting single-instance optimizations — vertical scaling, read replicas, better indexing, caching, and query optimization. When sharding is truly needed, the shard key is everything. I choose a key with high cardinality that aligns with the primary access pattern (usually tenant_id for SaaS, or user_id for consumer apps) so most queries hit a single shard. I use consistent hashing to minimize data movement when adding shards. For cross-shard queries (analytics, admin dashboards), I replicate data to an analytics database (ClickHouse, BigQuery) via CDC rather than running cross-shard queries. If starting fresh, I consider NewSQL databases (CockroachDB, TiDB) that provide automatic sharding with SQL compatibility, trading some per-query latency for operational simplicity.
Sharding prematurely. A properly tuned single PostgreSQL instance handles multiple terabytes. Sharding adds enormous complexity: cross-shard joins, distributed transactions, rebalancing, schema migrations across shards, and operational overhead. Exhaust all single-instance optimizations before considering sharding. Most startups that shard early regret it.