Analytical Enginescritical

BigQuery

BigQuery is Google Cloud serverless data warehouse. There is no infrastructure to manage — no clusters to size, no indexes to create, no vacuuming. You write SQL, BigQuery allocates resources, runs the query, and charges you for data scanned (on-demand) or reserved compute slots (flat-rate). It uses the Dremel columnar engine which can scan petabytes of data in seconds by distributing work across thousands of nodes.

Memory anchor

BigQuery is an all-you-can-eat buffet that charges by the plate weight: zero kitchen staff to manage (serverless), but SELECT * loads your plate with the entire buffet and costs $60. Pick only the columns you need. Partitioning is putting foods in separate rooms so you only enter the one you want.

Expected depth

Pricing models: on-demand charges per TB scanned ($6.25/TB as of late 2024 — always verify current pricing), which makes it critical to minimize data scanned through partitioning (by date, integer range, or ingestion time) and clustering (sorts data within partitions by up to 4 columns). Flat-rate pricing reserves compute slots (starting at 100 slots) for predictable costs. Materialized views automatically refresh and are used transparently by the query optimizer. BI Engine is an in-memory acceleration layer for sub-second dashboard queries on datasets under 200GB. Streaming inserts provide real-time data ingestion at higher cost vs batch loads (free via Storage Write API). BigQuery ML lets you train ML models directly in SQL — useful for quick experimentation but not production ML. External tables can query data in GCS, Drive, or Bigtable without loading it.

Deep — senior internals

BigQuery execution internals: Dremel decomposes a SQL query into a tree of execution stages, distributes leaf stages across thousands of workers, and aggregates results up the tree. Shuffle (data redistribution between stages) is the most expensive operation — queries that require shuffling large amounts of data (DISTINCT on high-cardinality columns, large JOINs without co-partitioning) are slow and expensive. Slot-based execution means your query performance depends on how many slots are allocated — during peak times on shared on-demand pools, queries may be slower. Partitioned tables with require_partition_filter = true force users to include a partition filter, preventing accidental full-table scans. INFORMATION_SCHEMA views provide query history, jobs metadata, and slot utilization — essential for cost monitoring. BigQuery Omni extends BigQuery to AWS and Azure data without moving it. Cost trap: SELECT * on a 10TB table costs $62.50 in on-demand mode — always preview with SELECT specific_columns and use partitioning.

🎤Interview-ready answer

BigQuery is my choice when I need serverless analytics, especially on GCP. Zero infrastructure management is the killer feature — no clusters to size, no indexes to maintain, no vacuuming. I optimize costs through three mechanisms: partitioning (by timestamp for time-series data), clustering (sorts data within partitions for efficient pruning), and selecting only needed columns (columnar storage means selecting 3 columns out of 100 reads 3% of the data). For predictable workloads, I use flat-rate slot reservations instead of on-demand per-TB pricing. BI Engine provides sub-second caching for dashboard queries. The main trade-off vs Snowflake is ecosystem: BigQuery is deeply integrated with GCP (Dataflow, Pub/Sub, Vertex AI), while Snowflake is cloud-agnostic. Vs ClickHouse, BigQuery trades raw query speed for zero operational overhead.

Common trap

Running SELECT * queries on large BigQuery tables with on-demand pricing. BigQuery charges per byte scanned, and columnar storage means selecting specific columns reads dramatically less data. A SELECT * on a 10TB table costs over $60 per query. Always select only the columns you need and partition your tables.

Related concepts