Document & NoSQLcritical

MongoDB

MongoDB is a document database that stores data as JSON-like documents (BSON). Each document can have a different structure, making it ideal for polymorphic data (e.g., a products collection where different product types have different attributes). It supports rich querying, secondary indexes, and aggregation pipelines. MongoDB excels when you need schema flexibility and rapid iteration, and when your data access patterns align with document-level operations.

Memory anchor

MongoDB is a junk drawer: toss anything in, every item can be shaped differently (documents). Great when your kitchen gadgets are genuinely all different shapes. Terrible when you realize you need to find all the forks (joins) — you end up dumping the whole drawer on the floor.

Expected depth

MongoDB uses the WiredTiger storage engine (default since 3.2) with document-level locking, compression, and in-memory caching. Sharding is horizontal scaling by distributing data across shards based on a shard key — choosing the right shard key is the most critical MongoDB design decision. A bad shard key (like a monotonically increasing timestamp) creates hot shards. Good shard keys have high cardinality, distribute writes evenly, and support your query patterns. Replica sets provide high availability with automatic failover: one primary (handles writes), multiple secondaries (handle reads with eventual consistency). The aggregation pipeline ($match, $group, $lookup, $unwind) is powerful but $lookup (joins) performs poorly at scale because MongoDB is not designed for joins. Multi-document transactions (since 4.0) work but add latency and should be used sparingly — if you need transactions everywhere, use PostgreSQL.

Deep — senior internals

WiredTiger internals: uses a cache (default 50% of RAM minus 1GB) with write-ahead logging. Compresses data with snappy (default) or zlib/zstd. The oplog is a capped collection that records all write operations for replication — oplog size determines how far behind a secondary can fall before needing a full resync. Change Streams (built on the oplog) provide real-time notifications of data changes — useful for event-driven architectures. Schema validation (JSON Schema) can enforce structure while maintaining flexibility. Atlas Search integrates Lucene-based full-text search directly into the aggregation pipeline, reducing the need for a separate Elasticsearch instance. MongoDB limitations to know: no foreign keys, no real joins (lookup is a hash join and slow at scale), aggregation pipeline memory limit (100MB default per stage), and transactions have a 16MB oplog entry limit.

🎤Interview-ready answer

I choose MongoDB when I have polymorphic data that does not fit cleanly into a relational schema — for example, a product catalog where electronics, clothing, and food items have completely different attributes. The document model lets each document have its own structure while still supporting indexes and queries. However, I am deliberate about this choice. If my data has relationships that require joins, I use PostgreSQL. The most critical MongoDB design decision is the shard key — it must have high cardinality, distribute writes evenly, and align with query patterns. A bad shard key creates hot shards that cannot be rebalanced without downtime. I also design around the limitation of no efficient joins: I denormalize aggressively and accept some data duplication in exchange for single-document read performance.

Common trap

Choosing MongoDB as a default database because it is popular or because the schema is not yet defined. An undefined schema means you have not yet understood your data model — that is a design problem, not a reason to choose a schemaless database. PostgreSQL with JSONB columns gives you schema flexibility when you need it while maintaining relational integrity for everything else.

Related concepts