Terminology
acknowledge (a record). A record is acknowledged to the writer when it is durably stored, according to the replication policy for that log.
ALL_SEND_ALL. A fallback mode for the read path where each storage node sends all records. It's less efficient than single-copy delivery (SCD), but better for detecting gaps.
appender. An in-memory object that makes sure a record is durably written to a copyset.
authoritative status The authoritative status of a shard indicates what role this shard is playing in the replication of data records.
bridge gap. On the read path, a bridge gap indicates that there are no records stored between two LSNs across 2 epochs. It indicates a benign gap, not a gap due to data loss.
bridge record. On the storage layer, a record that represents a bridge gap.
compaction. The process of combining multiple SST files into a single SST file to improve efficiency. Records that are older than their retention period may be removed during this process. Compaction occurs on a schedule. It may also be started by LogDevice as part of rebuilding.
copyset. The set of nodes on which a given record is actually stored.
dataloss gap. A gap between two sequence numbers that indicates that records were lost.
delta log. See snapshot.
disaggregated cluster. A cluster in which there are two types of hardware, where one type only handles the sequencing of records, and the other type stores records. These are referred to as sequencer and storage nodes, respectively.
donor. During rebuilding, the shard from which the record is being copied.
draining (of storage nodes). Storage nodes are drained when LogDevice intends to remove nodes from the cluster either temporarily (such as for maintenance) or indefinitely (for example, to shrink a cluster). No data is written to a node that is draining.
draining (of a sequencer). The sequencer stops accepting new appends but waits for all appenders in its sliding window to finish their replication and retire.
durability. Data has been stored, redundantly, so that it is not lost or compromised.
epoch. The upper 32 bits of the log sequence number (LSN).
ESN (epoch sequence number). The lower 32 bits of the log sequence number.
event log. An internal log that tracks shard authoritative states and transitions (such as rebuilding).
f-majority. A set of nodes that intersects every possible copyset. If a reader cannot find a particular record on an f-majority of nodes, then that record must be under-replicated or lost.
findtime. A kind of query to a LogDevice cluster that, given a timestamp, returns an LSN that approximately corresponds to that time.
gap record. A special kind of record that indicates there is no data at this LSN. Types include bridge, dataloss, and hole plug.
gossip. The protocol that nodes follow to contact each other for their status to arrive at a consensus view of the cluster.
historical nodeset. When an epoch changes, the nodeset is permitted to change as well. Nodesets that belong to older epochs are called historical nodesets.
hole plug. A special record indicating that there is no record with this LSN. No record was acknowledged with that sequence number. It indicates a benign gap, not a gap due to data loss.
internal log. A LogDevice cluster uses one of its own logs to maintain some shared internal state.
L0 file. A file used by RocksDB to store records. It's created when the in-memory tables are flushed to disk. Each RocksDB partition has a separate set of L0 files.
log. A record-oriented, append-only, trimmable file. Once appended, a record is immutable. Each record is uniquely identified by a monotonically increasing sequence number. Readers read records sequentially from any point inside the retention period.
log ID. Every log is identified by a numeric ID. Each log has a corresponding metadata log, which has an ID of 2^63 + log_id.
logsconfig. A pair of internal logs (and a replicated state machine that corresponds to them) that stores the configuration for logs.
LSN (log sequence number). A sequential ID used to order records in a log. The upper 32 bits are the epoch number, and the lower 32 are the offset within the epoch, or epoch sequence number (ESN). Normally, this number increases sequentially, but certain events, such as sequencer failover or changing replication settings, cause an "epoch bump", in which the epoch is incremented and the ESN is reset. Often formatted as eXXXnYYYY
, where XXX
is the epoch and YYYY
is the ESN.
metadata log. Every log has an associated metadata log that records things like historical nodesets and replication constraints.
mini-rebuilding. A rebuilding limited to a specified time range.
node. Synonym for server or host. Each node runs an instance of the LogDevice server.
nodeset. The set of disks that may have data for an epoch of a log. This is sometimes called a storage set.
non-authoritative rebuilding. If the number of shards that is down is more than the replication factor, it may not be possible to fully re-replicate all of the data records. When this happens, rebuilding is best effort, or non-authoritative. This may cause readers to stall if data is lost because some nodes remain unavailable. This requires a manual operation to unstall the readers and accept the data loss.
partial compaction. When several small L0 files are compacted into a larger L0 file, leaving most L0 files alone. Partial compaction is triggered automatically when there is a size disparity between L0 files.
partition. The database is made up of partitions. Each partition corresponds to a time range and contains all the record with timestamps that fall into that range. The typical length of that range is about 15 minutes and the data size in the partition is a few gigabytes. A new partition is created every few minutes to accommodate new data. Usually newly appended data is written to the latest partition, but all partitions are always writable. In particular, during rebuilding, many records are written to old partitions.
Reading works in two levels: at a higher level, the reader steps from partition to partition sequentially; at a lower level, it steps from record to record inside the partition until it reaches the end of partition.
rebuilding. The process of re-replicating records after a storage failure.
record. The smallest unit of data in a log. A record includes the log id, LSN, copyset, timestamp, and the payload.
recovery. The process of guaranteeing sanity at the tail of an epoch after a sequencer failure.
release. A record is released to the reader when it is durably stored and there are no in-flight records with a smaller LSN.
rewind. If a client is reading in SCD mode and determines there may be a gap, it tells the storage nodes to rewind and fall back to ALL_SEND_ALL mode.
RSM (replicated state machine). A state machine that tails a pair of internal logs to build a state that is eventually consistent across all hosts in the cluster.
SCD (single-copy delivery). An optimization used in the read path. Only the primary node for the record reads and sends the record to the client.
sealing. The first stage of recovery, in which the old epoch is fenced off to prevent any new writes happening to it.
sequencer. An object responsible for assigning LSNs to incoming records. Can also refer to a node that can run sequencers.
shard. A storage unit in a node.
snapshot. Each internal log is really two logs: the delta log and the snapshot log. The delta stores recent changes, and every so often, the changes are accumulated into a snapshot and written to the snapshot log.
SST (Sorted Sequence Table) file. Same as L0 file in RocksDB.
storage set. The set of shards that contain all records in an epoch.
trim. Delete older records in a log. Trimming can be retention-based, space-based, or on demand.
trim point. The trim point is the LSN for each log that says how far it has been trimmed. It's maintained by RocksDB. Records older than the trim point may not be deleted immediately, but they are not served to the client reader.
wave. If the appender cannot successfully store a record in a copyset, it picks a new one and tries again. Each attempt is called a wave.