Row vs Column Storage
Row-oriented storage (PostgreSQL, MySQL) stores all columns of a row together on disk. This is efficient for OLTP workloads where you frequently read or write entire rows. Column-oriented storage (ClickHouse, Parquet, BigQuery) stores all values of a single column together. This is efficient for analytical queries that scan a few columns across millions of rows, because the database reads only the columns it needs and achieves much better compression.
Row storage is a filing cabinet where each drawer holds one person's entire folder. Column storage is a spreadsheet taped to the wall — you can scan one column with your eyes instantly, but grabbing one person's full row means walking across the whole wall.
Column stores achieve 5-20x compression over row stores because values in a single column have the same data type and often similar values. Compression techniques include dictionary encoding (replacing repeated strings with integer IDs), run-length encoding (storing 'USA' x 1000000 as a single entry), delta encoding (storing timestamp differences), and bit-packing. Column stores also enable vectorized execution — processing a batch of values from one column through a CPU operation in a single SIMD instruction. The downside: column stores are terrible for point lookups and single-row writes because updating one row means touching every column file. This is why OLTP uses row stores and OLAP uses column stores.
Modern column stores use a hybrid approach. ClickHouse uses sparse primary indexes (one entry per ~8192 rows) rather than indexing every row, which keeps the index small enough to fit in memory even for trillion-row tables. Parquet files (used by Snowflake, BigQuery, Spark) organize data into row groups, and within each row group, data is stored columnar — this gives the benefits of columnar compression while still allowing row-group-level predicate pushdown. DuckDB brings columnar analytics into a single-process embedded database, making it possible to run OLAP queries on a laptop without a server. The PAX (Partition Attributes Across) storage model in some databases stores data columnar within each page but row-oriented across pages, attempting to get benefits of both.
Row stores keep all columns of a row together, optimizing for point lookups and transactional writes. Column stores keep all values of a column together, optimizing for analytical scans and aggregations. The key advantage of column stores is compression — same-type values compress 5-20x better — and vectorized execution, where the CPU processes batches of column values in SIMD instructions. In system design, I use row stores (PostgreSQL) for the application layer and column stores (ClickHouse, BigQuery) for analytics. The rule is simple: if your query touches specific rows, use a row store; if it scans millions of rows across a few columns, use a column store.
Thinking column stores are universally faster. For a query like 'SELECT * FROM users WHERE id = 123', a row store with a B-tree index returns the answer in microseconds. A column store would need to read every column file and reassemble the row — orders of magnitude slower for point lookups.