Data Modeling & Designhigh

Normalization vs Denormalization

Normalization (3NF) eliminates data redundancy by splitting data into related tables. Each fact is stored once, reducing inconsistencies. Denormalization intentionally duplicates data to avoid joins and speed up reads. OLTP systems are typically normalized (minimize write anomalies). OLAP systems are typically denormalized (minimize join overhead at query time). The choice depends on whether your workload is write-heavy (normalize) or read-heavy with known query patterns (denormalize).

Memory anchor

Normalization is storing your friend's phone number once in your contacts (single source of truth). Denormalization is writing it on sticky notes in every room (fast to find, nightmare to update when they change numbers). OLTP normalizes to avoid sticky-note chaos; OLAP denormalizes because it only reads, never updates.

Expected depth

Normalization forms: 1NF (atomic values, no repeating groups), 2NF (no partial dependencies on composite keys), 3NF (no transitive dependencies — every non-key column depends only on the primary key). Most production systems aim for 3NF. Beyond 3NF (BCNF, 4NF, 5NF) is rarely practical. Denormalization strategies: embed related data (store user name in the orders table), pre-compute aggregates (store order_count on the user record), materialize views (create a wide table from joins). The cost of denormalization: data redundancy means updates must propagate to all copies (update a username and you must update it in orders too). Denormalization works well when: reads vastly outnumber writes, query patterns are known and stable, and eventual consistency of denormalized copies is acceptable.

Deep — senior internals

In distributed databases, denormalization is often mandatory. DynamoDB single-table design is extreme denormalization — all entity types in one table with composite keys. MongoDB embedding is denormalization — store comments inside the blog post document. Cassandra requires denormalization because it does not support joins at all — you model tables around query patterns. The modern approach is often 'normalize at write, denormalize at read': store normalized data in PostgreSQL, and create denormalized views in ClickHouse, Elasticsearch, or Redis for specific read patterns. Change Data Capture (CDC) via Debezium or native logical replication keeps the denormalized copies in sync. This gives you the best of both worlds: consistency in the source of truth and performance in the read layer.

🎤Interview-ready answer

My approach is to normalize the source of truth (PostgreSQL in 3NF) and denormalize for read performance where needed. The primary database stores each fact once, ensuring consistency. When specific read patterns need better performance — search, analytics, caching — I create denormalized representations in the appropriate engine (Elasticsearch for search, ClickHouse for analytics, Redis for caching) and keep them in sync via CDC or event streams. This avoids the classic denormalization trap: updating a user name and having to chase that update across 20 tables. The source of truth handles the update once, and downstream consumers get the change via events.

Common trap

Denormalizing your primary database for read performance and then struggling with update anomalies when data changes need to propagate to multiple locations. Keep the primary database normalized and denormalize in secondary read stores that can be rebuilt from the source of truth.

Related concepts