What is Elasticsearch in Databases?

Time-Series & Searchhigh

Elasticsearch

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It uses an inverted index to provide fast full-text search: given a search term, it instantly finds all documents containing that term. Elasticsearch excels at full-text search (product search, site search), log aggregation (ELK stack), and any workload that requires relevance-ranked text search.

Memory anchor

Elasticsearch is the index at the back of every book in a library: you say 'blue shoes' and it instantly points to every book and page mentioning them (inverted index). But it is NOT the books themselves — never use the index as your only copy. If the index gets torn, you rebuild it from the actual books (PostgreSQL).

Expected depth

The inverted index maps every unique term to the list of documents containing it. When you search for 'blue running shoes,' Elasticsearch looks up 'blue,' 'running,' and 'shoes' in the index and intersects the results. BM25 scoring ranks results by relevance (term frequency, inverse document frequency, field length). Analyzers control text processing: tokenization (splitting text into terms), lowercasing, stemming (running -> run), and stop word removal. Mapping types define field data types — text fields are analyzed for full-text search, keyword fields are exact-match only. Index settings (number_of_shards, number_of_replicas) are set at creation and shards cannot be changed later (you must reindex). An Elasticsearch cluster distributes shards across nodes with primary/replica shard allocation for HA.

Deep — senior internals

Elasticsearch operational concerns: shard sizing (target 10-50GB per shard, avoid thousands of tiny shards), JVM heap (set to 50% of RAM but never more than 31GB to stay in compressed oops), and the split-brain problem (use a minimum of 3 master-eligible nodes with discovery.zen.minimum_master_nodes set to quorum). Hot-warm-cold architecture uses index lifecycle management (ILM) to move older indices to cheaper storage tiers. Elasticsearch is NOT a good primary database: it sacrifices consistency for performance (documents are eventually consistent after refresh_interval, default 1 second), it does not support transactions, and data can be lost during split-brain scenarios. Always use it as a secondary index alongside a primary database (PostgreSQL as source of truth, Elasticsearch for search). OpenSearch is the AWS fork after Elastic changed its license — functionally similar for most use cases.

🎤Interview-ready answer

I use Elasticsearch for two primary use cases: full-text search and log aggregation. For search, the inverted index with BM25 scoring provides relevance-ranked results that relational databases cannot match. For log aggregation, the ELK stack (Elasticsearch, Logstash, Kibana) or its modern alternatives (Loki for logs, but Elasticsearch for structured log analytics) handle high-volume ingestion and fast searching across billions of log entries. Critically, I never use Elasticsearch as a primary database. It is an inverted index optimized for search, not a system of record. It is eventually consistent (default 1-second refresh interval), does not support transactions, and can lose data during split-brain. My architecture is always: primary database (PostgreSQL) -> CDC or event pipeline -> Elasticsearch index. If the index gets corrupted, I rebuild it from the source of truth.

⚠Common trap

Using Elasticsearch as your primary database. Elasticsearch is optimized for search, not for being a system of record. It is eventually consistent, does not support transactions, and data loss during split-brain scenarios is a real risk. Always maintain a separate source of truth (PostgreSQL, DynamoDB) and treat Elasticsearch as a derived index that can be rebuilt.

Related concepts