LSM Basic
The code is in src/storage/lsm
. The architecture of the LSM-tree in Wing is as follows.
DBImpl
: At the highest level, the DBImpl
is the primary component responsible for interacting with users. It references the most recent SuperVersion
of the database. It is in lsm.hpp
.
SuperVersion
: It includes a MemTable, a list of immutable MemTables and the on-disk LSM-tree Version
, which together represent the current state of the database. It is in version.hpp
.
MemTable
: An in-memory ordered data structure. It is flushed to disk when it reaches its capacity. It is in memtable.hpp
.
Version
: An array of levels, representing the on-disk LSM-tree. It is in version.hpp
.
Level
: It is composed of one or more sorted runs. It is in level.hpp
.
SortedRun
: It can be viewed as a sorted key-value array which is divided into several SSTables. It is in level.hpp
.
SSTable
: It is composed of data blocks, an index, bloom filter and metadata. It is in sst.hpp
.
Block
: It stores records. It is in block.hpp
.
Record Format
Records in the database are structured as (key, seq, type, value)
tuple, where seq
denotes the timestamp (or sequence number) and type
denotes the record type. A record with type=RecordType::Value
represents the key-value pair at the timestamp seq
. When type
equals to RecordType::Deletion
, the record indicates that the key has been marked for deletion at the timestamp seq
.
(key, seq, type)
is called the internal key of record (see storage/lsm/format.hpp
) because seq
and type
are invisible to users. We can define a comparing function to sort them: for two internal keys (key0, seq0, type0)
and (key1, seq1, type1)
, the former is smaller than the latter if and only if key0 < key1
or key0 == key1 && seq0 > seq1
. With this comparing function, for (key, seq)
, the newest record with the same key and a sequence number seq0 <= seq
is the first record (key0, seq0)
with (key, seq) <= (key0, seq0)
.
Get
DBImpl::Get(key, seq, &value_str)
function is designed to retrieve records from the database. It looks for the first record (key0, seq0, type0)
satisfying key0 == key && seq0 <= seq
. If the record is a record with type RecordType::Deletion
or no such record exists, the function returns false, indicating the requested value is not found. If a record with type RecordType::Value
is found, the function returns true and copies the associated value into value_str
.
The process of Get
is:
-
Check if MemTable has the record. If not, check if immutable MemTables have the record. It checks from the newest immutable MemTable to the oldest immutable MemTable. It stops once it finds the record. If they do not have the record, proceed to check the on-disk LSM-tree.
-
Check if Level 0 has the record, then Level 1, 2, and so on. For each level, it performs a binary search to find the SSTable that possibly has the record. Then it queries the bloom filter. If further inquiry is necessary, it performs a binary search on the SSTable's index to find the data block. It reads the data block from the disk, and performs a binary search to find the record.
Put and Delete
DBImpl::Put(key, seq)
creates a new record (key, seq, RecordType::Value, value)
and DBImpl::Del(key, seq)
creates a new record (key, seq, RecordType::Deletion)
. Once a record is created, it is inserted to the MemTable. For convenience, a lock is employed to ensure that only a single writer can perform operations at any given time.
If the MemTable reaches its capacity, it creates a new superversion and moves the MemTable to the immutable MemTable list and create a new MemTable.
The database has 2 threads to persist the data to disk: the flush thread and the compaction thread. The flush thread is awakened whenever a new immutable MemTable is created. It flushes the immutable MemTables to the first level (Level 0) of the LSM-tree. Every time an immutable Memtable is flushed, the compaction thread is awakened. It acquires new compaction tasks through CompactionPicker::Get
. Once it has tasks, it proceeds to execute them and subsequently updates the superversion.
Scan
The database supports range scans. DBImpl::Begin()
returns an iterator positioned to the beginning of the data, while DBImpl::Seek(key, seq)
returns an iterator positioned to the first record (key0, seq0, type0)
satisfying (key, seq) <= (key0, seq0)
.
Scan operations are performed in a snapshot. When a DBIterator
is created, it stores the current sequence number. It can only see the records with sequence number smaller than the stored sequence number.
The architecture of iterators is as follows. BlockIterator
is the iterator on data blocks. SSTableIterator
is the iterator on SSTables and contains a BlockIterator
. SortedRunIterator
is the iterator on sorted runs and contains a SSTableIterator
. SuperVersionIterator
is the iterator on superversions, it contains all the SortedRunIterator
s and MemTableIterator
s using IteratorHeap
, which maintains the record with the minimum internal key by maintaining iterators in a heap. There is no LevelIterator
or VersionIterator
because it is inefficient to maintain two IteratorHeap
s. The DBIterator
operates at the highest level, merging records with the same key and skipping the keys which are marked deleted.
The interfaces of Iterator
can be found in storage/lsm/iterator.hpp
. Here is an example of usage:
DBIterator it = ...; // create the iterator
while (it.Valid()) {
// Read key and value of the current entry
auto key = it.key();
auto value = it.value();
// output key and value
DB_INFO("{}, {}", key, value);
// Move to the next entry
it.Next();
// After moving, key and value may be invalid because they are `Slice`s,
// i.e. references to a internal buffer in `it`.
}