Aggregation & GROUP BY
GROUP BY collapses rows sharing the same value into groups, applying aggregate functions (COUNT, SUM, AVG, MAX, MIN) per group. HAVING filters groups after aggregation, like WHERE filters rows before.
GROUP BY is a spreadsheet pivot table — it collapses matching rows into a single row per bucket. HAVING is the filter you apply AFTER the pivot; WHERE is the filter before the pivot.
Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. You cannot use column aliases from SELECT in WHERE or HAVING (they're evaluated before SELECT). DISTINCT eliminates duplicate rows from the result, not from aggregation. COUNT(*) counts all rows including NULLs; COUNT(column) ignores NULLs. GROUPING SETS, ROLLUP, and CUBE generate multiple grouping levels in one query — ROLLUP(year, month) produces (year, month), (year), and () totals.
GROUP BY can reference expression results or positional references (GROUP BY 1, 2). PostgreSQL allows GROUP BY on columns not in SELECT (unlike some stricter SQL modes). The optimizer may use a hash aggregate (build hash table of groups, spills to disk if hash exceeds work_mem) or a sort aggregate (sort the data, then scan for group boundaries). For large aggregations, consider partial aggregation in parallel workers. FILTER clause on aggregates: COUNT(*) FILTER (WHERE status = 'active') — conditional aggregation without CASE inside the aggregate.
GROUP BY groups rows for aggregate functions. The key rule is that every non-aggregated column in SELECT must appear in GROUP BY. I use HAVING for group-level filtering, WHERE for row-level filtering before grouping. COUNT(*) includes NULLs; COUNT(column) doesn't. ROLLUP generates subtotals in one query — I use it for reports needing totals at multiple levels.
Using WHERE to filter aggregate results. WHERE filters before GROUP BY — aggregates don't exist yet. HAVING filters after GROUP BY, where aggregates exist. SELECT department, AVG(salary) WHERE AVG(salary) > 50000 is invalid; use HAVING AVG(salary) > 50000.