An Inverted Index is the core data structure powering most modern search engines (like Google, Elasticsearch, Lucene). Instead of mapping documents to the words they contain (like a book's index), it maps terms (usually words) to the documents (or locations within documents) where they appear.
Think of the index at the back of a book: it lists keywords and the page numbers where they can be found. An Inverted Index does the same, but often on a much larger scale, mapping words to Document IDs.
Why Use It?
Fast Full-Text Search: Finding all documents containing a specific term is extremely fast. Instead of scanning every document, you just look up the term in the index and get the list of relevant documents directly.
Efficient Querying: Supports complex queries involving multiple terms (AND, OR, NOT) by performing efficient operations (like intersection or union) on the document lists retrieved for each term.
Scalability: The index can be partitioned and distributed across multiple machines.
How It's Built and Used
Document Input: Documents (web pages, articles, product descriptions, etc.) are fed into the indexing process.
Tokenization & Normalization: Each document's text is broken down into individual tokens (usually words). These tokens are then normalized: converted to lowercase, punctuation removed, and common "stop words" (like "the", "a", "is") often discarded as they don't carry much search value. Sometimes stemming (reducing words to their root form, e.g., "running" -> "run") is also applied.
Index Construction: The core step. A data structure (like a hash map or dictionary) is built where:
The Keys are the normalized terms (words).
The Values are lists (called posting lists) containing the IDs of the documents where that term appears. Often, extra information like term frequency or positions within the document is also stored (not fully visualized here).
Example Index Entry: "search" -> [Doc1, Doc5, Doc42]
Querying:
A user query (e.g., "fast search algorithm") is tokenized and normalized using the same process used during indexing. (Result: `["fast", "search", "algorithm"]`).
For each term in the query, the system looks up its posting list in the inverted index.
`fast` -> `[Doc2, Doc5]`
`search` -> `[Doc1, Doc5, Doc42]`
`algorithm` -> `[Doc5, Doc99]`
The posting lists are combined based on the query logic (e.g., AND, OR). For an AND query ("fast" AND "search" AND "algorithm"), the system finds the intersection of the lists: `[Doc5]`.
For an OR query, it finds the union: `[Doc1, Doc2, Doc5, Doc42, Doc99]`.
The resulting document IDs are typically ranked based on relevance (using algorithms like TF-IDF or BM25, not visualized here) and presented to the user.
Visualize the Indexing and Search
Add documents to build the index, then search for terms to see the lookup process.