LSM Tree

What is an LSM Tree?

The Log-Structured Merge-Tree (LSM Tree) is a data structure optimized for systems with high write volumes (like many NoSQL databases, search engines, and data warehouses). Unlike traditional B-Trees which often perform in-place updates (requiring potentially slow random disk I/O), LSM Trees buffer writes in memory and write data to disk sequentially in immutable batches.

Core Idea: Trading Read Performance for Write Performance

LSM Trees make writes very fast by absorbing them into an in-memory structure (Memtable) and appending to a log (Write-Ahead Log - WAL, not explicitly visualized) for durability. Reads, however, might need to check multiple places (memory and several disk files) to find the latest value.

Components

Memtable: An in-memory, sorted data structure (like a balanced tree or skip list). All new writes (inserts, updates, deletes) go here first. Provides very fast writes and reads for recent data.
Immutable Memtable: When the active Memtable reaches a certain size, it becomes immutable (read-only) and a new active Memtable is created. The immutable one is scheduled to be flushed to disk.
Write-Ahead Log (WAL): (Optional, for durability, *not explicitly visualized here*) Before writing to the Memtable, the operation is appended sequentially to a log file on disk. If the system crashes, the WAL is used to rebuild the Memtable on restart, preventing data loss for writes that weren't yet flushed to SSTables.
SSTables (Sorted String Tables): Immutable files on disk containing sorted key-value pairs flushed from Memtables. They are typically organized into Levels (L0, L1, L2, ...).
- Level 0 (L0): Usually contains SSTables flushed directly from Memtables. Files in L0 might have overlapping key ranges.
- Level 1+ (L1, L2...): Contain SSTables resulting from Compaction. Files within these levels typically have non-overlapping key ranges. Each level is often significantly larger than the previous one (e.g., 10x).

Operations

Write / Update:
1. (Optional) Write operation to WAL.
2. Insert/Update key-value pair in the active Memtable. If the key exists, the new value overwrites the old one (conceptually, based on timestamp).
3. If Memtable size exceeds threshold, trigger a Flush.
Delete:
1. (Optional) Write delete operation to WAL.
2. Insert a special marker (tombstone) for the key in the Memtable, indicating deletion.
Flush:
1. Make the current Memtable immutable.
2. Create a new, empty active Memtable.
3. Write the sorted contents of the Immutable Memtable to a new SSTable file on disk, usually in Level 0 (L0).
4. Once the SSTable is written, the Immutable Memtable (and corresponding WAL entries) can be discarded.
Compaction: A background process crucial for performance and space management.
1. Typically triggered when a level reaches a size/file count threshold (e.g., L0 has too many files).
2. Selects one or more SSTables from a level (e.g., Level `i`).
3. Selects overlapping SSTables from the next level (Level `i+1`).
4. Merges the sorted data from all selected SSTables, keeping only the newest version of each key and discarding deleted (tombstoned) or overwritten data.
5. Writes the merged, sorted result into new SSTable(s) in Level `i+1`.
6. Once the new SSTables are written, the old input SSTables from Level `i` and Level `i+1` are deleted.
Read (Get): To find the value for a key:
1. Check the active Memtable. If found (and not a tombstone), return the value.
2. Check the Immutable Memtable (if one exists). If found (and not a tombstone), return value.
3. Check SSTables in Level 0 (L0), usually from newest to oldest file. If found, return value (if not a tombstone).
4. Check SSTables in Level 1 (L1). Since keys don't overlap within L1, only one file needs checking per key range. If found, return value (if not a tombstone).
5. Continue checking subsequent levels (L2, L3...) until the key is found or all levels are exhausted.
6. If a tombstone is found for the key before a value, the key is considered deleted, return "not found".
7. If the key is not found anywhere, return "not found".

This read process potentially checking multiple places is called read amplification. Compaction helps reduce this by merging data and pushing it to higher levels.

Log-Structured Merge-Tree (LSM Tree)

What is an LSM Tree?

Core Idea: Trading Read Performance for Write Performance

Components

Operations

Visualize the LSM Tree

Memtable (In-Memory, Sorted)

Immutable Memtable (Flushing...)

Disk Storage (SSTable Levels)