LDQuery
append_outliers
Provides debugging information for the Sequencer Boycotting feature, which consists in boycotting sequencer nodes that are outliers when we compare their append success rate with the rest of the clusters. This table contains the state of a few nodes in the cluster that are responsible for gathering stats from all other nodes, aggregate them and decide the list of outliers.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
observed_node_id | int | Node id for which we are measuring stats. |
appends_success | long | Number of appends that this node completed successfully. |
appends_failed | long | Number of appends that this node did not complete successfully. |
msec_since | long | Time in milliseconds since the node has been considered an outlier. |
is_outlier | bool | True if this node is considered an outlier. |
append_throughput
For each sequencer node, reports the estimated per-log-group append throughput over various time periods. Because different logs in the same log group may have their sequencer on different nodes in the cluster, it is necessary to aggregate these rates across all nodes in the cluster to get an estimate of the global append throughput of a log group. If Sequencer Batching is enabled, this table reports the rate of appends incoming, ie before batching and compression on the sequencer.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_group_name | string | The name of the log group. |
throughput_1min | long | Throughput average in the past 1 minute. |
throughput_5min | long | Throughput average in the past 5 minutes. |
throughput_10min | long | Throughput average in the past 10 minutes. |
catchup_queues
CatchupQueue is a state machine that manages all the read streams of one client on one socket. It contains a queue of read streams for which there are new records to be sent to the client (we say these read streams are not caught up). Read streams from that queue are processed (or "woken-up") in a round-robin fashion. The state machine is implemented in logdevice/common/CatchupQueue.h.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
client | string | Id of the client. |
queued_total | long | Number of read streams queued in that CatchupQueue (ie read streams that are not caught up). |
queued_immediate | long | Number of read streams that are not queued with a delay, see "queue_delayed". |
queued_delayed | long | (Number of read streams that are queued with a delay. When these read streams are queued, CatchupQueue waits for a configured amount of time before dequeuing the stream for processing. This happens if the log is configured with the "delivery_latency" option, which enables better batching of reads when tailing. See "deliveryLatency" logdevice/include/LogAttributes.h. |
record_bytes_queued | long | (CatchupQueue also does accounting of how many bytes are enqueued in the socket's output evbuffer. CatchupQueue wakes up several read streams until the buffer reaches the limit set by the option --output-max-records-kb (see logdevice/common/Settings.h). |
storage_task_in_flight | int | Each read stream is processed one by one. When a read stream is processed, it will first try to read some records from the worker thread if there are some records that can be read from RocksDB's block cache. When all records that could be read from the worker thread were read, and if there are more records that can be read, the read stream will issue a storage task to read such records in a slow storage thread. This flag indicates whether or not there is such a storage task currently in flight. |
ping_timer_active | int | Ping timer is a timer that is used to ensure we eventually try to schedule more reads under certain conditions. This column indicates whether the timer is currently active. |
chunk_rebuildings
In-flight ChunkRebuilding state machines - each responsible for re-replicating a short range of records for the same log wich consecutive LSNs and the same copyset (see ChunkRebuilding.h). See also: shard_rebuildings, log_rebuildings
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log ID of the records. |
shard | int | Index of the shard to which the records belong. |
min_lsn | lsn | LSN of first record in the chunk. |
max_lsn | lsn | LSN of last record in the chunk. |
chunk_id | long | ID of the chunk, unique within a process. |
block_id | long | Sticky copyset block to which the records belong. Block can be split into multiple chunks. |
total_bytes | long | Sum of records' payload+header sizes. |
oldest_timestamp | time | Timestamp of the first record in the chunk. |
stores_in_flight | long | Number of records for which we're in the process of storing new copies. |
amends_in_flight | long | Number of records for which we're in the process of amending copysets of existing copies, excluding our own copy. |
amend_self_in_flight | long | Number of records for which we're in the process of amending copysets of our own copy. |
started | time | Time when the ChunkRebuilding was started. |
client_read_streams
ClientReadStream is the state machine responsible for reading records of a log on the client side. The state machine connects to all storage nodes that may contain data for a log and request them to send new records as they are appended to the log. For each ClientReadStream there is one ServerReadStream per storage node the ClientReadStream is talking to. The "readers" table lists all existing ServerReadStreams. Because LDQuery does not fetch any debugging information from clients connected to the cluster, the only ClientReadStreams that will be shown in this table are for internal read streams on the server.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Id of the log being read. |
id | long | Internal identifier for the read stream. |
next_lsn_to_deliver | long | Next LSN that needs to be delivered to the client. |
window_high | lsn | Current window. This is used for flow control. ClientReadStream instructs storage nodes to not send records with LSNs higher than this value. ClientReadStream slides the window as it is able to make progress (see the --client-read-flow-control-threshold setting). |
until_lsn | lsn | LSN up to which the ClientReadStream must read. |
read_set_size | long | Number of storage nodes this ClientReadStream is reading from. Usally equal to the size of the log's nodeset but may be smaller if some nodes from that nodeset are rebuilding. |
gap_end_outside_window | lsn | Sometimes there are no more records to deliver that have a LSN smaller than "window_high". When a storage node reaches the end of the window, it sends a GAP message to inform of the next LSN it will be able to ship past the window. This value is the smallest LSN greater than "window_high" reported by storage nodes and is used for determining the right endpoint of a gap interval in such a situation. |
trim_point | lsn | When a storage node reaches a log's trim point, it informs ClientReadStream through a gap message. This is ClientReadStream's current view of the log's trim point. |
gap_nodes_next_lsn | string | Contains the list of nodes for which we know they don't have a record with LSN "next_lsn_to_deliver". Alongside the node id is the LSN of the next record or gap this storage node is expected to send us. |
unavailable_nodes | string | List of nodes that ClientReadStream knows are unavailable and thus is not trying to read from. |
connection_health | string | Summary of authoritative status of the read session. An AUTHORITATIVE session has a least an f-majority of nodes participating. Reported dataloss indicates all copies of a record were lost, or, much less likely, the only copies of the data are on the R-1 nodes that are currently unavailable, and the cluster failed to detect or remediate the failure that caused some copies to be lost. A NON_AUTHORITATIVE session has less than an f-majority of nodes participating, but those not participating have had detected failures, are not expected to participate, and are being rebuilt. Most readers will stall in this case. Readers that proceed can see dataloss gaps for records that are merely currently unavailable, but will become readable once failed nodes are repaired. An UNHEALTHY session has too many storage nodes down but not marked as failed, to be able to read even non-authoritatively. |
redelivery_inprog | bool | True if a retry of delivery to the application is outstanding. |
filter_version | long | Read session version. This is bumped every time parameters (start, until, SCD, etc.) are changed by the client. |
cluster_state
Fetches the state of the gossip-based failure detector from the nodes of the cluster. When the status column is OK, the dead_nodes column contains a list of dead nodes as seen by the node in question. When status is OK, the unhealthy_nodes column contains a list of unhealthy nodes as seen by the node in question. When status is OK, the overloaded_nodes column contains a list of overloaded nodes as seen by the node in question. When the status is anything but OK, it means the request failed for this node, and it may be dead itself.
Column | Type | Description |
---|---|---|
node_id | int | Id of the node. |
status | string | Status of the node. |
dead_nodes | string | List of node IDs that this node believes to be dead. |
boycotted_nodes | string | List of boycotted nodes. |
unhealthy_nodes | string | List of node IDs tha this node believes to be unhealthy |
overloaded_nodes | string | List of node IDs tha this node believes to be overloaded |
epoch_store
EpochStore is the data store that contains epoch-related metadata for all the logs provisioned on a cluster. This table allows querying the metadata in epoch-store for a set of logs.
Column | Type | Description |
---|---|---|
log_id | log_id | Id of the log. |
status | string | "OK" if the query to the epoch store succeeded for that log id. If the log could not be found (which only happens if the user provided query constraints on the "log_id" column), set to NOTFOUND. If we failed to contact the epoch store, set to one of NOTCONN, ACCESS, SYSLIMIT, FAILED. |
since | long | Epoch since which the metadata ("replication", "storage_set", "flags") are in effect. |
epoch | long | Next epoch to be assigned to a sequencer. |
replication | string | Current replication property of the log. |
storage_set_size | long | Number of shards in storage_set. |
storage_set | string | Set of shards that may have data records for the log in epochs ["since", "epoch" - 1]. |
flags | string | Internal flags. See "logdevice/common/EpochMetaData.h" for the description of each flag. |
nodeset_signature | long | Hash of the parts of config that potentially affect the nodeset. |
target_nodeset_size | long | Storage set size that was requested from NodeSetSelector. Can be different from storage_set_size for various reasons, see EpochMetaData.h |
nodeset_seed | long | Random seed used when selecting nodeset. |
lce | long | Last epoch considered clean for this log. Under normal conditions, this is equal to "epoch" - 2. If this value is smaller, this means that the current sequencer needs to run the Log Recovery procedure on epochs ["lce" + 1, "epoch" - 2] and readers will be unable to read data in these epochs until they are cleaned. |
meta_lce | long | Same as "lce" but for the metadata log of this data log. |
written_by | string | Id of the last node in the cluster that updated the epoch store for that log. |
tail_record | string | Human readable string that describes tail record |
event_log
Dump debug information about the EventLogStateMachine objects running on nodes in the cluster. The event log is the Replicated State Machine that coordinates rebuilding and contains the authoritative status of all shards in the cluster. This table can be used to debug whether all nodes in the cluster are caught up to the same state.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
delta_log_id | log_id | Id of the delta log. |
snapshot_log_id | log_id | Id of the snapshot log. |
version | lsn | Version of the state. |
delta_read_ptr | lsn | LSN of the last record or gap read from the delta log. |
delta_replay_tail | lsn | On startup, the state machine reads the delta log up to that lsn (inclusive) before delivering the initial state to subscribers. |
snapshot_read_ptr | lsn | Read pointer in the snapshot log. |
snapshot_replay_tail | lsn | On startup, the state machine reads the snapshot log up to that lsn before delivering the initial state to subscribers. |
stalled_waiting_for_snapshot | lsn | If not null, this means the state machine is stalled because it missed data in the delta log either because it saw a DATALOSS or TRIM gap. The state machine will be stalled until it sees a snapshot with a version greather than this LSN. Unless another node writes a snapshot with a bigger version, the operator may have to manually write a snapshot to recover the state machine. |
delta_appends_in_flight | long | How many deltas are currently being appended to the delta log by this node. |
deltas_pending_confirmation | long | How many deltas are currently pending confirmation on this node, ie these are deltas currently being written with the CONFIRM_APPLIED flag, and the node is waiting for the RSM to sync up to that delta's version to confirm whether or not it was applied. |
snapshot_in_flight | long | Whether a snapshot is being appended by this node. Only one node in the cluster is responsible for creating snapshots (typically the node with the smallest node id that's alive according to the failure detector). |
delta_log_bytes | long | Number of bytes of delta records that are past the last snapshot. |
delta_log_records | long | Number of delta records that are past the last snapshot. |
delta_log_healthy | bool | Whether the ClientReadStream state machine used to read the delta log reports itself as healthy, ie it has enough healthy connections to the storage nodes in the delta log's storage set such that it should be able to not miss any delta. |
propagated_read_ptr | lsn | All updates up to this LSN (exclusive) were fully propagated to all state machines (in particular, to RebuildingCoordinator). |
graylist
Provides information on graylisted storage nodes per worker per node. This works only for the outlier based graylisting.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
worker_id | int | The id of the worker running on the node. |
graylisted_node_index | int | The graylisted node ID |
historical_metadata
This table contains information about historical epoch metadata for all logs. While the "epoch_store" table provides information about the current epoch metadata of all logs, this table provides a history of that metadata for epoch ranges since the epoch of the first record that is not trimmed.
Column | Type | Description |
---|---|---|
log_id | log_id | Id of the log |
status | string | "OK" if the query to retrieve historical metadata succeeded for that log id. If the log is not in the config (which only happens if the user provided query constraints on the "log_id" column), set to INVALID_PARAM. If we failed to read the log's historical metadata, set to one of TIMEDOUT, ACCESS, FAILED. |
since | long | Epoch since which the metadata ("replication", "storage_set", "flags") are in effect. |
epoch | long | Epoch up to which the metadata is in effect. |
replication | string | Replication property for records in epochs ["since", "epoch"]. |
storage_set_size | long | Number of shards in storage_set. |
storage_set | string | Set of shards that may have data records for the log in epochs ["since", "epoch"]. |
flags | string | Internal flags. See "logdevice/common/EpochMetaData.h" for the description of each flag. |
historical_metadata_legacy
Same as "historical_metadata", but retrieves the metadata less efficiently by reading the metadata logs directly instead of contacting the sequencer. Provides two additional "lsn" and "timestamp" columns to identify the metadata log record that contains the metadata.
Column | Type | Description |
---|---|---|
log_id | log_id | Id of the log |
status | string | "OK" if the query to retrieve historical metadata succeeded for that log id. If the log is not in the config (which only happens if the user provided query constraints on the "log_id" column), set to INVALID_PARAM. If we failed to read the log's historical metadata, set to one of TIMEDOUT, ACCESS, FAILED. |
since | long | Epoch since which the metadata ("replication", "storage_set", "flags") are in effect. |
epoch | long | Epoch up to which the metadata is in effect. |
replication | string | Replication property for records in epochs ["since", "epoch"]. |
storage_set_size | long | Number of shards in storage_set. |
storage_set | string | Set of shards that may have data records for the log in epochs ["since", "epoch"]. |
flags | string | Internal flags. See "logdevice/common/EpochMetaData.h" for the description of each flag. |
lsn | lsn | LSN of the metadata log record that contains this metadata |
timestamp | long | Timestamp of the metadata log record that contains this metadata |
info
A general information table about the nodes in the cluster, like server start time, package version etc.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
pid | long | Process ID of logdeviced. |
version | string | A string that holds the version and revision, built by who and when. |
package | string | Package name and hash. |
build_user | string | Unixname of user who built this package |
build_time | string | Date and Time of the build. |
start_time | string | Date and Time of when the daemon was started on that node. |
server_id | string | Server Generated ID. |
shards_missing_data | string | A list of the shards that are empty and waiting to be rebuilt. |
min_proto | long | Minimum protocol version supported. |
max_proto | long | Maximum protocol version supported. |
is_auth_enabled | bool | Whether authentication is enabled for this node. |
auth_type | string | Authentication Type. Can be null or one of "self_identification" (insecure authentication method where the server trusts the client), "ssl" (the client can provide a TLS certificate by using the "ssl-load-client-cert" and "ssl-cert-path" settings). See "AuthenticationType" in "logdevice/common/SecurityInformation.h" for more information. |
is_unauthenticated_allowed | bool | Is anonymous access allowed to this server (set only if authentication is enabled, ie "auth_type" is not null). |
is_permission_checking_enabled | bool | Do we check permissions? |
permission_checker_type | string | Permission checker type. Can be null or one of "config" (this method stores the permission data in the config file), "permission_store" (this method stores the ACLs in the config file while the permissions and users are stored in an external store). See "PermissionCheckerType" in "logdevice/common/SecurityInformation.h" for more information. |
rocksdb_version | string | Version of RocksDB. |
info_config
A table that dumps information about all the configurations loaded by each node in the cluster. For each node, there will be one row for the node's configuration which is in the main config.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
uri | string | URI of the config. |
source | int | ID of the node this config originates from. This may not necessarily be the same as "node" as nodes can synchronize configuration between each other. |
hash | string | Hash of the config. |
last_modified | time | Date and Time when the config was last modified. |
last_loaded | time | Date and Time when the config was last loaded. |
info_rsm
Show RSM in-memory and durable version information in the cluster
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
peer_id | int | Peer ID |
state | string | State of Peer Node In FailureDetector |
logsconfig_in_memory_version | lsn | logsconfig in-memory version |
logsconfig_durable_version | lsn | logsconfig durable version |
eventlog_in_memory_version | lsn | eventlog in-memory version |
eventlog_durable_version | lsn | eventlog durable version |
iterators
This table allows fetching the list of RocksDB iterators on all storage nodes.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
column_family | string | Name of the column family that the iterator is open on. |
log_id | log_id | ID of the log the iterator is reading. |
is_tailing | long | 1 if it is a tailing iterator, 0 otherwise. |
is_blocking | long | 1 if this is an iterator which is allowed to block when the blocks it reads are not in RocksDB's block cache, 0 otherwise. |
type | string | Type of the iterator. See "IteratorType" in "logdevice/server/locallogstore/IteratorTracker.h" for the list of iterator types. |
rebuilding | long | 1 if this iterator is used for rebuilding, 0 if in other contexts. |
high_level_id | long | A unique identifier that the high-level iterator was assigned to. Used to tie higher-level iterators with lower-level iterators created by them. |
created_timestamp | time | Timestamp when the iterator was created. |
more_context | string | More information on where this iterator was created. |
last_seek_lsn | lsn | Last LSN this iterator was seeked to. |
last_seek_timestamp | time | When the iterator was last seeked. |
version | long | RocksDB superversion that this iterator points to. |
log_groups
A table that lists the log groups configured in the cluster. A log group is an interval of log ids that share common configuration property.
Column | Type | Description |
---|---|---|
name | string | Name of the log group. |
logid_lo | log_id | Defines the lower bound (inclusive) of the range of log ids in this log group. |
logid_hi | log_id | Defines the upper bound (inclusive) of the range of log ids in this log group. |
replication_property | string | Replication property configured for this log group. |
synced_copies | int | Number of copies that must be acknowledged by storage nodes are synced to disk before the record is acknowledged to the client as fully appended. |
max_writes_in_flight | int | The largest number of records not released for delivery that the sequencer allows to be outstanding. |
backlog_duration_sec | int | Time-based retention of records of logs in that log group. If null or zero, this log group does not use time-based retention. |
storage_set_size | int | Size of the storage set for logs in that log group. The storage set is the set of shards that may hold data for a log. |
delivery_latency | int | For logs in that log group, maximum amount of time that we can delay delivery of newly written records. This option increases delivery latency but improves server and client performance. |
scd_enabled | int | Indicates whether the Single Copy Delivery optimization is enabled for this log group. This efficiency optimization allows only one copy of each record to be served to readers. |
custom_fields | string | Custom text field provided by the user. |
log_rebuildings
This table dumps some per-log state of rebuilding on this donor node, mostly related to reading. See also shard_rebuildings.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | ID of the log. |
shard | int | Index of the shard from which the ShardRebuilding state machine is reading. |
until_lsn | lsn | LSN up to which the log must be rebuilt. See "logdevice/server/rebuilding/RebuildingPlanner.h" for how this LSN is computed. |
rebuilt_up_to | lsn | Next LSN to be considered by this state machine for rebuilding. |
num_replicated | long | Number of records replicated by this state machine so far. |
bytes_replicated | long | Number of bytes replicated by this state machine so far. |
log_storage_state
Tracks all in-memory metadata for logs on storage nodes (see "info log_storage_state" admin command).
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log id for which the storage state is. |
shard | int | Shard for which that log storage state is. |
last_released | lsn | Last Released LSN as seen by the shard. If this does not match the sequencer's last_released_lsn, (See "sequencer" table) this means the shard has not completed purging. |
last_released_src | string | Where the last_released value was gotten from. Either "sequencer" if the sequencer sent a release message, or "local log store" if the value was persisted on disk in the storage shard. |
trim_point | lsn | Trim point for that log on this storage node. |
per_epoch_metadata_trim_point | lsn | Trim point of per-epoch metadata. PerEpochLogMetadata whose epoch is <= than this value should be trimmed. |
seal | long | Normal seal. The storage node will reject all stores with sequence numbers belonging to epochs that are <= than this value. |
sealed_by | string | Sequencer node that set the normal seal. |
soft_seal | long | Similar to normal seal except that the sequencer did not explictly seal the storage node, the seal is implicit because a STORE message was sent by a sequencer for a new epoch. |
soft_sealed_by | string | Sequencer node that set the soft seal. |
last_recovery_time | long | Latest time (number of microseconds since steady_clock's epoch) when some storage node tried to recover the state. To not be confused with Log recovery. |
log_removal_time | long | See LogStorageState::log_removal_time_. |
lce | long | Last clean epoch. Updated when the sequencer notifies this storage node that it has performed recovery on an epoch. |
latest_epoch | long | Latest seen epoch from the sequencer. |
latest_epoch_offset | string | Offsets within the latest epoch |
permanent_errors | long | Set to true if a permanent error such as an IO error has been encountered. When this is the case, expect readers to not be able to read this log on this storage shard. |
logsconfig_rsm
Dump debug information about the LogsConfigStateMachine objects running on nodes in the cluster. The config log is the replicated state machine that stores the logs configuration of a cluster.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
delta_log_id | log_id | Id of the delta log. |
snapshot_log_id | log_id | Id of the snapshot log. |
version | lsn | Version of the state. |
delta_read_ptr | lsn | LSN of the last record or gap read from the delta log. |
delta_replay_tail | lsn | On startup, the state machine reads the delta log up to that lsn (inclusive) before delivering the initial state to subscribers. |
snapshot_read_ptr | lsn | Read pointer in the snapshot log. |
snapshot_replay_tail | lsn | On startup, the state machine reads the snapshot log up to that lsn before delivering the initial state to subscribers. |
stalled_waiting_for_snapshot | lsn | If not null, this means the state machine is stalled because it missed data in the delta log either because it saw a DATALOSS or TRIM gap. The state machine will be stalled until it sees a snapshot with a version greather than this LSN. Unless another node writes a snapshot with a bigger version, the operator may have to manually write a snapshot to recover the state machine. |
delta_appends_in_flight | long | How many deltas are currently being appended to the delta log by this node. |
deltas_pending_confirmation | long | How many deltas are currently pending confirmation on this node, ie these are deltas currently being written with the CONFIRM_APPLIED flag, and the node is waiting for the RSM to sync up to that delta's version to confirm whether or not it was applied. |
snapshot_in_flight | long | Whether a snapshot is being appended by this node. Only one node in the cluster is responsible for creating snapshots (typically the node with the smallest node id that's alive according to the failure detector). |
delta_log_bytes | long | Number of bytes of delta records that are past the last snapshot. |
delta_log_records | long | Number of delta records that are past the last snapshot. |
propagated_read_ptr | long | All updates up to this LSN (exclusive) were fully propagated to all state machines. |
logsdb_directory
Contains debugging information about the LogsDB directory on storage shards.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
shard | int | ID of the shard. |
log_id | long | ID of the log. |
partition | long | ID of the partition. |
first_lsn | lsn | Lower bound of the LSN range for that log in this partition. |
max_lsn | lsn | Upper bound of the LSN range for that log in this partition. |
flags | string | Flags for this partition. "UNDER_REPLICATED" means that some writes for this partition were lost (for instance due to the server crashing) and these records have not yet been rebuilt. |
approximate_size_bytes | long | Approximate data size in this partition for the given log. |
logsdb_metadata
List of auxiliary RocksDB column families used by LogsDB (partitioned local log store). "metadata" column family contains partition directory and various logdevice metadata, per-log and otherwise. "unpartitioned" column family contains records of metadata logs and event log.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
shard | long | Index of the local log store shard that this column family belongs to. |
column_family | string | Name of the column family. |
approx_size | long | Estimated size of the column family in bytes. |
l0_files | long | Number of sst (immutable table) files belonging to this column family. |
immutable_memtables | long | Number of inactive memtables that are still kept in memory. See 'partitions' table for details. |
memtable_flush_pending | long | Number of memtables that are in the process of being flushed to disk. |
active_memtable_size | long | Size in bytes of the active memtable. |
all_memtables_size | long | Size in bytes of all memtables. See 'partitions' table for details. |
est_num_keys | long | Estimated number of keys stored in the column family. The estimate tends to be poor because of merge operator. |
est_mem_by_readers | long | Estimated memory used by rocksdb iterators in this column family, excluding block cache. See 'partitions' table for details. |
live_versions | long | Number of live "versions" of this column family in RocksDB. See 'partitions' table for details. |
nodes
Lists the nodes in the cluster from the configuration.
Column | Type | Description |
---|---|---|
node_id | int | Id of the node |
name | string | Human readable name of the node |
address | string | Ip and port that should be used for communication with the node |
ssl_address | string | Same as "address" but with SSL |
admin_address | string | The IP address, including port number, for admin server |
generation | long | Generation of the node. This value is bumped each time the node is swapped, sent to repair, or has one of its drives sent to repair. |
location | string | Location of the node: |
sequencer | int | 1 if this node is provisioned for the sequencing role. Otherwise 0. Provisioned roles must be enabled in order to be considered active. |
storage | int | 1 if this node is provisioned for the storage role. Otherwise 0. Provisioned roles must be enabled in order to be considered active. See 'storage_state'. |
sequencer_weight | real | A non-negative value indicating how many logs this node should be a sequencer for relative to other nodes in the cluster. A value of 0 means this node cannot run sequencers. |
storage_state | string | Determines the current state of the storage node. One of "read-write", "read-only" or "none". |
storage_weight | real | A positive value indicating how much STORE traffic this storage node should receive relative to other storage nodes in the cluster. |
num_shards | long | Number of storage shards on this node. 0 if this node is not a storage node. |
is_metadata_node | int | 1 if this node is in the metadata nodeset. Otherwise 0. |
partitions
List of LogsDB partitions that store records. Each shard on each node has a sequence of partitions. Each partition corresponds to a time range a few minutes or tens of minutes log. Each partition is a RocksDB column family.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
shard | long | Index of the local log store shard that this partitions belongs to. |
id | long | Sequence number of this partition. Partitions in each shard are numbered in chronological order without gaps. |
start_time | time | The beginning of the time range that this partition corresponds to. For most partitions it's equal to the time when partition was created. The exception is partitions created retroactively to accommodate rebuilt data. |
min_time | time | Approximate minimum timestamp of records stored in this partition. |
max_time | time | Approximate maximum timestamp of records stored in this partition. |
min_durable_time | time | Persisted (WAL sync has occurred since the value was written) version of 'min_time'. |
max_durable_time | time | Persisted (WAL sync has occurred since the value was written) version of 'max_time'. |
last_compacted | time | Last time when the partition was compacted. |
approx_size | long | Estimated size of the partition in bytes. |
l0_files | long | Number of sst (immutable table) files belonging to this partition. |
immutable_memtables | long | Number of inactive memtables (im-memory write buffers) that are still kept in memory. The most common cases are memtables in the process of being flushed to disk (memtable_flush_pending) and memtables pinned by iterators. |
memtable_flush_pending | long | Number of memtables (im-memory write buffers) that are in the process of being flushed to disk. |
active_memtable_size | long | Size in bytes of the active memtable. |
all_not_flushed_memtables_size | long | Size in bytes of all memtables that weren't flushed to disk yet. The difference all_memtables_size-all_not_flushed_memtables_size is usually equal to the total size of memtables pinned by iterators. |
all_memtables_size | long | Size in bytes of all memtables. Usually these are: active memtable, memtables that are being flushed and memtables pinned by iterators. |
est_num_keys | long | Estimated number of keys stored in the partition. They're usually records and copyset index entries. The estimate tends to be poor when merge operator is used. |
est_mem_by_readers | long | Estimated memory used by rocksdb iterators in this partition, excluding block cache. This pretty much only includes SST file indexes loaded in memory. |
live_versions | long | Number of live "versions" of this column family in RocksDB. One (current) version is always live. If this value is greater than one, it means that some iterators are pinning some memtables or sst files. |
current_version | long | The current live version |
append_dirtied_by | string | Nodes that have uncommitted append data in this partition. |
rebuild_dirtied_by | string | Nodes that have uncommitted rebuild data in this partition. |
copyset_index_enabled | bool | Whether or not copyset_index is enabled for this partition. This is controlled by --rocksdb-write-copyset-index setting. |
purges
List the PurgeUncleanEpochs state machines currently active in the cluster. The responsability of this state machine is to delete any records that were deleted during log recovery on nodes that did not participate in that recovery. See "logdevice/server/storage/PungeUncleanEpochs.h" for more information.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log ID the purge state machine is for. |
state | string | State of the state machine. |
current_last_clean_epoch | long | Last clean epoch considered by the state machine. |
purge_to | long | Epoch up to which this state machine should purge. The state machine will purge epochs in range ["current_last_clean_epoch", "purge_to"]. |
new_last_clean_epoch | long | New "last clean epoch" metadata entry to write into the local log store once purging completes. |
sequencer | string | ID of the sequencer node that initiated purging. |
epoch_state | string | Dump the state of purging for each epoch. See "logdevice/server/storage/PurgeSingleEpoch.h" |
readers
Tracks all ServerReadStreams. A ServerReadStream is a stream of records sent by a storage node to a client running a ClientReadStream (see "client_read_streams" table).
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
shard | int | Shard on which the stream is reading. |
client | string | Name of the client (similar to so the "client" column of the "sockets table"). |
log_id | log_id | Id of the log being read. |
start_lsn | lsn | LSN from which the ClientReadStream started reading. |
until_lsn | lsn | LSN up to which ClientReadStream is interested in reading. |
read_pointer | lsn | LSN to the next record to be read by the storage node. |
last_delivered | lsn | Last LSN sent to the ClientReadStream either through a data record or a gap. |
last_record | lsn | LSN of the last record the storage node delivered to the ClientReadStream. This value is not necessarily equal to "last_delivered" as the storage node does not store all records of a log. |
window_high | lsn | Current window used by the ClientReadStream. This instructs the storage node to not send records with LSNs higher than this value. When the read pointer reaches this value, the stream is caught up and the storage node will wake up the stream when the window is slid by ClientReadStream. |
last_released | lsn | Last LSN released for delivery by the sequencer. If this value is stale, check if purging could be the reason using the "purges" table. |
catching_up | int | Indicates whether or not the stream is catching up, ie there are records that can be delivered now to the client, and the stream is enqueued in a CatchupQueue (see the "catchup_queues" table). |
window_end | int | Should be true if "read_pointer" is past "window_high". |
known_down | string | List of storage nodes that the ClientReadStream is not able to receive records from. This list is used so that other storage nodes can send the records that nodes in that list were supposed to send. If this columns shows "ALL_SEND_ALL", this means that the ClientReadStream is not running in Single-Copy-Delivery mode, meaning it requires all storage nodes to send every record they have. This can be either because SCD is not enabled for the log, or because data is not correctly replicated and ClientReadStream is not able to reconstitute a contiguous sequencer of records from what storage nodes are sending, or because the ClientReadStream is going through an epoch boundary. |
filter_version | long | Each time ClientReadStream rewinds the read streams (and possibly changes the "known_down" list), this counter is updated. Rewinding a read stream means asking the storage node to rewind to a given LSN. |
last_batch_status | string | Status code that was issued when the last batch of records was read for this read stream. See LocalLogStoreReader::read() in logdevice/common/LocalLogStore.h. |
created | time | Timestamp of when this ServerReadStream was created. |
last_enqueue_time | time | Timestamp of when this ServerReadStream was last enqueued for processing. |
last_batch_started_time | time | Timestamp of the last time we started reading a batch of records for this read stream. |
storage_task_in_flight | int | True if there is currently a storage task running on a slow storage thread for reading a batch of records. |
csid | string | Client Session ID |
rsid | string | Read stream ID |
tcp_sndbuf | int | Number of bytes in TCP sndbuf waiting to be sent |
record
This table allows fetching information about individual record copies in the cluster. The user must provide query constraints on the "log_id" and "lsn" columns. This table can be useful to introspect where copies of a record are stored and see their metadata. Do not use it to serve production use cases as this query runs very inneficiently (it bypasses the normal read protocol and instead performs a point query on all storage nodes in the cluster).
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | ID of the log this record is for. |
lsn | lsn | Sequence number of the record. |
shard | int | ID of the shard that holds this record copy. |
wave | int | If "is_written_by_recovery" is 0, contains the wave of that record. |
recovery_epoch | int | If "is_written_by_recovery" is 1, contains the "sequencer epoch" of the log recovery. |
timestamp | string | Timestamp in milliseconds of the record. |
last_known_good | int | Highest ESN in this record's epoch such that at the time this messagewas originally sent by a sequencer all records with this and lower ESNs in this epoch were known to the sequencer to be fully stored on R nodes. |
copyset | string | Copyset of the record. |
flags | string | Flags for that record. See "logdevice/common/LocalLogStoreRecordFormat.h" to see the list of flags. |
offset_within_epoch | string | Amount of data written to that record within the epoch. |
optional_keys | string | Optional keys provided by the user. See "AppendAttributes" in "logdevice/include/Record.h". |
is_written_by_recovery | bool | Whether this record was replicated by the Log Recovery. |
payload | string | Payload in hex format. |
record_cache
Dumps debugging information about the EpochRecordCache entries in each storage shard in the cluster. EpochRecordCache caches records for a log and epoch that are not yet confirmed as fully stored by the sequencer (ie they are "unclean").
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log ID for this EpochRecordCache enrty. |
shard | int | Shard ID for this EpochRecordCache entry. |
epoch | long | The epoch of this EpochRecordCache entry. |
payload_bytes | long | Total size of payloads of records above LNG held by this EpochRecordCache. |
num_records | long | Number of records above LNG held by this EpochRecordCache. |
consistent | bool | True if the cache is in consistent state and it is safe to consult it as the source of truth. |
disabled | bool | Whether the cache is disabled. |
head_esn | long | ESN of the head of the buffer. |
max_esn | long | Largest ESN ever put in the cache. |
first_lng | long | The first LNG the cache has ever seen since its creation. |
offset_within_epoch | string | Most recent value of the amount of data written in the given epoch as seen by this shard. |
tail_record_lsn | long | LSN of the tail record of this epoch. |
tail_record_ts | long | Timestamp of the tail record of this epoch. |
record_csi
This table allows fetching information about individual record copies in the cluster. The user must provide query constraints on the "log_id" and "lsn" columns. This table can be useful to introspect where copies of a record are stored and see their metadata. Do not use it to serve production use cases as this query runs very inneficiently (it bypasses the normal read protocol and instead performs a point query on all storage nodes in the cluster). This table is different from the record table in the sense that it only queries the copyset index and therefore can be more efficient. It can be used to check for divergence between the data and copyset index.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | ID of the log this record is for. |
lsn | lsn | Sequence number of the record. |
shard | int | ID of the shard that holds this record copy. |
wave | int | If "is_written_by_recovery" is 0, contains the wave of that record. |
recovery_epoch | int | If "is_written_by_recovery" is 1, contains the "sequencer epoch" of the log recovery. |
timestamp | string | Timestamp in milliseconds of the record. |
last_known_good | int | Highest ESN in this record's epoch such that at the time this messagewas originally sent by a sequencer all records with this and lower ESNs in this epoch were known to the sequencer to be fully stored on R nodes. |
copyset | string | Copyset of the record. |
flags | string | Flags for that record. See "logdevice/common/LocalLogStoreRecordFormat.h" to see the list of flags. |
offset_within_epoch | string | Amount of data written to that record within the epoch. |
optional_keys | string | Optional keys provided by the user. See "AppendAttributes" in "logdevice/include/Record.h". |
is_written_by_recovery | bool | Whether this record was replicated by the Log Recovery. |
payload | string | Payload in hex format. |
recoveries
Dumps debugging information about currently running Log Recovery procedures. See "logdevice/common/LogRecoveryRequest.h" and "logdevice/common/EpochRecovery.h".
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log being recovered. |
epoch | long | Epoch being recovered. |
state | string | State of the EpochRecovery state machine. Can be one of "FETCHING_LCE" (We are fetching LCE from the epoch store, an EpochRecovery state machine will be created for all epochs between LCE+1 and "epoch"), "READING_METADATA_LOG" (We are reading metadata log in order to retrieve metadata necessary to start the EpochRecovery state machines), "READING_SEQUENCER_METADATA" (We are waiting for the epoch metadata of "epoch" to appear in the metadata log), "ACTIVE" (EpochRecovery is active), "INACTIVE" (EpochRecovery is scheduled for activation). |
lng | long | EpochRecovery's estimate of LNG for this log. |
dig_sz | long | Number of entries in the digest. |
dig_fmajority | long | Whether or not an f-majority of shards in that epoch's storage set have completed the digest phase. |
dig_replic | long | Whether or not the set of shards that have completed the digest meet the replication requirements. |
dig_author | long | Whether the digest is authoritative. If the digest is not authoritative, this means too many shards in the storage set are under-replicated. This is an emergency procedure in which recovery will not plug holes in order to ensure DATALOSS gaps are reported to readers. In this mode, some legitimate holes may be reported as false positive DATALOSS gaps to the reader. |
holes_plugged | long | Number of holes plugged by this EpochRecovery. |
holes_replicate | long | Number of holes re-replicated by this EpochRecovery. |
holes_conflict | long | Number of hole/record conflicts found by this EpochRecovery. |
records_replicate | long | Number of records re-replicated by this EpochRecovery. |
n_mutators | long | Number of active Mutators. A mutator is responsible for replicating a hole or record. See "logdevice/common/Mutator.h". |
recovery_state | string | State of each shard in the recovery set. Possible states: "s" means that the shard still has not sent a SEALED reply, "S" means that the shard has sent a SEALED reply, "d" means that this shard is sending a digest, "D" means that this shard completed the digest, "m" means that this shard has completed the digest and is eligible to participate in the mutation phase, "c" means that the shard has been sent a CLEAN request, "C" means that the shard has successfully processed the CLEANED request. Recovery will stall if too many nodes are in the "s", "d" or "c" phases. Suffix "(UR) indicates that the shard is under-replicated. Suffix "(AE)" indicates that the shard is empty. |
created | time | Date and Time of when this EpochRecovery was created. |
restarted | time | Date and Time of when this EpochRecovery was last restarted. |
n_restarts | long | Number of times this EpochRecovery was restarted. |
sequencers
This table dumps information about all the Sequencer objects in the cluster. See "logdevice/common/Sequencer.h".
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log ID this sequencer is for. |
metadata_log_id | string | ID of the corresponding metadata log. |
state | string | State of the sequencer. Can be one of: "UNAVAILABLE" (Sequencer has not yet gotten a valid epoch metadata with an epoch number), "ACTIVATING" (Sequencer is in the process of getting an epoch number and retrieving metadata from the epoch store), "ACTIVE" (Sequencer is able to replicate), "PREEMPTED" (Sequencer has been preempted by another sequencer "preempted_by", appends to this node will be redirected to it), "PERMANENT_ERROR" (Permanent process-wide error such as running out of ephemera ports). |
epoch | long | Epoch of the sequencer. |
next_lsn | lsn | Next LSN to be issued to a record by this sequencer. |
meta_last_released | lsn | Last released LSN of the metadata log. |
last_released | lsn | Last released LSN for the data log. |
last_known_good | lsn | Last known good LSN for this data log. This is the highest ESN such that all records up to that ESN are known to be fully replicated. |
in_flight | long | Number of appends currently in flight. |
last_used_ms | long | Timestamp of the last record appended by this sequencer. |
state_duration_ms | long | Amount of time in milliseconds the sequencer has been in the current "state". |
nodeset_state | string | Contains debugging information about shards in that epoch's storage set. "H" means that the shard is healthy. "L" means that the shard reached the low watermark for space usage. "O" means that the shard reported being overloaded. "S" means that the shard is out of space. "U" means that the sequencer cannot establish a connection to the shard. "D" means that the shard's local log store is not accepting writes. "G" means that the shard is greylisting for copyset selection because it is too slow. "P" means that the sequencer is currently probling the health of this shard. |
preempted_epoch | long | Epoch of the sequencer that preempted this sequencer (if any). |
preempted_by | long | ID of the sequencer that preempted this sequencer (if any). |
draining | long | Epoch that is draining (if any). Draining means that the sequencer stopped accepting new writes but is completing appends curretnly in flight. |
metadata_log_written | long | Whether the epoch metadata used by this sequencer has been written to the metadata log. |
trim_point | lsn | The current trim point for this log. |
last_byte_offset | string | Offsets of the tail record. |
bytes_per_second | real | Append throughput averaged over the last throughput_window_seconds seconds. |
throughput_window_seconds | real | Time window over which append throughput estimate bytes_per_second was obtained. |
seconds_until_nodeset_adjustment | real | Time until the next potential nodeset size adjustment or nodeset randomization. Zero if nodeset adjustment is disabled or if the sequencer reactivation is in progress. |
settings
Dumps the state of all settings for all nodes in the cluster.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
bundle_name | string | Name of the bundle this setting is for. |
name | string | Name of the setting. |
current_value | string | Current value of the setting. |
default_value | string | Default value of the setting. |
from_cli | string | Value provided by the CLI, or null. |
from_config | string | Value provided by the config or null. |
from_admin_cmd | string | Value provided by the "set" admin command or null. |
shard_authoritative_status
Show the current state in the event log. This contains each shard's authoritative status (see "logdevice/common/AuthoritativeStatus.h"), as well as additional information related to rebuilding.
Column | Type | Description |
---|---|---|
node_id | int | Id of the node. |
shard | long | Id of the shard. |
rebuilding_version | string | Rebuilding version: the LSN of the last SHARD_NEEDS_REBUILD delta from the event log. |
authoritative_status | string | Authoritative status of the shard. |
donors_remaining | string | If authoritative status is UNDERREPLICATION, list of donors that have not finished rebuilding the under-replicated data. |
drain | int | Whether the shard is being drained or has been drained. |
mode | string | Whether rebuilding is in RESTORE or RELOCATE mode |
dirty_ranges | string | Time ranges where this shard may be missing data. This happens if the LogDevice process on this storage node crashed before committing data to disk. |
rebuilding_is_authoritative | int | Whether rebuilding is authoritative. A non authoritative rebuilding means that too many shards lost data such that all copies of some records may be unavailable. Some readers may stall when this happens and there are some shards that are still marked as recoverable. |
data_is_recoverable | int | Indicates whether the shard's data has been marked as unrecoverable using ldshell mark-unrecoverable . If all shards in the rebuilding set are marked unrecoverable, shards for which rebuilding completed will transition to AUTHORITATIVE_EMPTY status even if that rebuilding is non authoritative. Note that if logdeviced is started on a shard whose corresponding disk has been wiped by a remediation, the shard's data will automatically be considered unrecoverable. |
source | string | Entity that triggered rebuilding for this shard. |
rebuilding_started_ts | string | When rebuilding was started. |
rebuilding_completed_ts | string | When the shard transitioned to AUTHORITATIVE_EMPTY. |
shard_authoritative_status_spew
Like shard_authoritative_status_verbose but has even more columns.
Column | Type | Description |
---|---|---|
node_id | int | Id of the node. |
shard | long | Id of the shard. |
rebuilding_version | string | Rebuilding version: the LSN of the last SHARD_NEEDS_REBUILD delta from the event log. |
authoritative_status | string | Authoritative status of the shard. |
donors_remaining | string | If authoritative status is UNDERREPLICATION, list of donors that have not finished rebuilding the under-replicated data. |
drain | int | Whether the shard is being drained or has been drained. |
mode | string | Whether rebuilding is in RESTORE or RELOCATE mode |
dirty_ranges | string | Time ranges where this shard may be missing data. This happens if the LogDevice process on this storage node crashed before committing data to disk. |
rebuilding_is_authoritative | int | Whether rebuilding is authoritative. A non authoritative rebuilding means that too many shards lost data such that all copies of some records may be unavailable. Some readers may stall when this happens and there are some shards that are still marked as recoverable. |
data_is_recoverable | int | Indicates whether the shard's data has been marked as unrecoverable using ldshell mark-unrecoverable . If all shards in the rebuilding set are marked unrecoverable, shards for which rebuilding completed will transition to AUTHORITATIVE_EMPTY status even if that rebuilding is non authoritative. Note that if logdeviced is started on a shard whose corresponding disk has been wiped by a remediation, the shard's data will automatically be considered unrecoverable. |
source | string | Entity that triggered rebuilding for this shard. |
rebuilding_started_ts | string | When rebuilding was started. |
rebuilding_completed_ts | string | When the shard transitioned to AUTHORITATIVE_EMPTY. |
acked | int | Whether the node acked the rebuilding. (Why would such nodes remain in the rebuilding set at all? No one remembers now.) |
ack_lsn | string | LSN of the SHARD_ACK_REBUILT written by this shard. |
ack_version | string | Version of the rebuilding that was acked. |
donors_complete | string | |
donors_complete_authoritatively | string |
shard_authoritative_status_verbose
Like shard_authoritative_status but has more columns and prints all the shards contained in the RSM state, including the noisy ones, e.g. nodes that were removed from config. Can be useful for investigating specifics of the event log RSM behavior, but not for much else.
Column | Type | Description |
---|---|---|
node_id | int | Id of the node. |
shard | long | Id of the shard. |
rebuilding_version | string | Rebuilding version: the LSN of the last SHARD_NEEDS_REBUILD delta from the event log. |
authoritative_status | string | Authoritative status of the shard. |
donors_remaining | string | If authoritative status is UNDERREPLICATION, list of donors that have not finished rebuilding the under-replicated data. |
drain | int | Whether the shard is being drained or has been drained. |
mode | string | Whether rebuilding is in RESTORE or RELOCATE mode |
dirty_ranges | string | Time ranges where this shard may be missing data. This happens if the LogDevice process on this storage node crashed before committing data to disk. |
rebuilding_is_authoritative | int | Whether rebuilding is authoritative. A non authoritative rebuilding means that too many shards lost data such that all copies of some records may be unavailable. Some readers may stall when this happens and there are some shards that are still marked as recoverable. |
data_is_recoverable | int | Indicates whether the shard's data has been marked as unrecoverable using ldshell mark-unrecoverable . If all shards in the rebuilding set are marked unrecoverable, shards for which rebuilding completed will transition to AUTHORITATIVE_EMPTY status even if that rebuilding is non authoritative. Note that if logdeviced is started on a shard whose corresponding disk has been wiped by a remediation, the shard's data will automatically be considered unrecoverable. |
source | string | Entity that triggered rebuilding for this shard. |
rebuilding_started_ts | string | When rebuilding was started. |
rebuilding_completed_ts | string | When the shard transitioned to AUTHORITATIVE_EMPTY. |
acked | int | Whether the node acked the rebuilding. (Why would such nodes remain in the rebuilding set at all? No one remembers now.) |
ack_lsn | string | LSN of the SHARD_ACK_REBUILT written by this shard. |
ack_version | string | Version of the rebuilding that was acked. |
shard_rebuildings
Show debugging information about the ShardRebuilding state machines (see "logdevice/server/rebuilding/ShardRebuilding.h"). This state machine is responsible for coordinating reads and re-replication on a donor shard.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
shard_id | long | Donor shard. |
rebuilding_set | string | The list of shards that lost record copies which need to be re-replicated elsewhere. Expressed in the form "<shard-id>?[<dirty-ranges>],...". "" indicates that the shard may be up but we want to drain its data by replicating it elsewhere. If <dirty-ranges> is not empty, this means that the storage shard only lost data within the specified ranges. |
version | lsn | Rebuilding version. This version comes from the event log RSM that coordinates rebuilding. See the "event_log" table. |
global_window_end | time | End of the global window (if enabled with --rebuilding-global-window). This is a time window used to synchronize all ShardRebuilding state machines across all donor shards. |
progress_timestamp | time | Approximately how far rebuilding has progressed on this donor, timestamp-wise. This may be the min timestamp of records of in-flight RecordRebuilding-s, or partition timestamp that ReadStorageTask has reached, or something else. |
num_logs_waiting_for_plan | long | Number of logs that are waiting for a plan. See "logdevice/include/RebuildingPlanner.h". |
total_memory_used | long | Approximate total amount of memory used by ShardRebuilding state machine. |
num_active_logs | long | Set of logs being rebuilt for this shard. The shard completes rebuilding when this number reaches zero. |
participating | int | true if this shard is a donor for this rebuilding and hasn't finished rebuilding yet. |
time_by_state | string | Time spent in each state. 'stalled' means either waiting for global window or aborted because of a persistent error. |
task_in_flight | int | True if a storage task for reading records is in queue or in flight right now. |
persistent_error | int | True if we encountered an unrecoverable error when reading. Shard shouldn't stay in this state for more than a few seconds: it's expected that RebuildingCoordinator will request a rebuilding for this shard, and rebuilding will rewind without this node's participation. |
read_buffer_bytes | long | Bytes of records that we've read but haven't started re-replicating yet. |
records_in_flight | long | Number of records that are being re-replicated right now. |
read_pointer | string | How far we have read: partition, log ID, LSN. |
progress | real | Approximately what fraction of the work is done, between 0 and 1. -1 if the implementation doesn't support progress estimation. |
shards
Show information about all shards in a cluster.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
shard | long | Shard the information is for. |
is_failing | long | If true, the server could not open the DB for this shard on startup. This can happen if the disk on which this shard resides is broken for instance. |
accepting_writes | string | Status indicating if this shard is accepting writes. Can be one of: "OK" (the shard is accepting writes), "LOW_ON_SPC" (the shard is accepting writes but is low on free space), "NOSPC" (the shard is not accepting writes because it is low on space), "DISABLED" (The shard will never accept writes. This can happen if the shard entered fail-safe mode). |
rebuilding_state | string | "NONE": the shard is not rebuilding. "WAITING_FOR_REBUILDING": the shard is missing data and is waiting for rebuilding to start. "REBUILDING": the shard is missing data and rebuilding was started. |
default_cf_version | long | Returns current version of the data. if LogsDB is enabled, this will return the version of the default column familiy. |
dirty_state | string | Status indicating if this shard has dirty ranges or not. |
sockets
Tracks all Connections on all nodes in the cluster.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
state | string | State of the Connection. I: The connection is Inactive; C: The connection is connecting; H: The connection is doing the handshake at the LD protocol level; A: The connection is active. |
name | string | Name of the Connection. If the other end is a client, the format is similar to the column "client" of the table "catchup_queues" and the column "client" of the table "readers". If the other end is another node in the cluster, describes that's node's id and ip. |
pending_kb | real | Number of bytes that are available for writing on the Connection's output buffer. If this value is high this usually means that the other end is not able to read messages as fast as we are writing them. |
available_kb | real | Number of bytes that are available for reading on the Connection's input buffer. If this value is high this usually means that the other end is writing faster than this node is able to read. |
read_mb | real | Number of bytes that were read from the Connection. |
write_mb | real | Number of bytes that were written to the Connection. |
read_cnt | int | Number of messages that were read from the Connection. |
write_cnt | int | Number of messages that were written to the Connection. |
bytes_per_second | real | Connection throughput in the last health check period. |
rwnd_limited_pct | real | Portion of last health check period, when Connection throughput was limited by receiver. |
sndbuf_limited_pct | real | Portion of last health check peiod, when Connection throughput was limited by send buffer. |
proto | int | Protocol that was handshaken. Do not trust this value if the Connection's state is not active. |
sendbuf | int | Size of the send buffer of the underlying TCP socket. |
is_ssl | int | Set to true if this Connection uses SSL. |
fd | int | The file descriptor of the underlying os socket. |
stats
Return statistics for all nodes in the cluster. See "logdevice/common/stats/". See "stats_rocksdb" for statistics related to RocksDB.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
name | string | Name of the stat counter. |
value | long | Value of the stat counter. |
stats_rocksdb
Return RocksDB statistics for all nodes in the cluster.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
name | string | Name of the stat counter. |
value | long | Value of the stat counter. |
storage_tasks
List of storage tasks currently pending on the storage thread queues. Note that this does not include the task that is currently executing, nor the tasks that are queueing on the per-worker storage task queues. Querying this table prevents the storage tasks from being popped off the queue while it's executing, so be careful with it.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
shard | long | Index of the local log store shard to query storage tasks on. |
priority | string | Priority of the storage task. The tasks with a higher priority get executed before tasks with a lower priority. |
is_write_queue | bool | True if this is a task from the write_queue, otherwise this is a task from the ordinary storage task queue for the thread. |
sequence_no | long | Sequence No. of the task. For task that are on the same queue with the same priority, tasks with lower sequence No. will execute first. |
thread_type | string | The type of the thread the task is queueing for. |
task_type | string | Type of the task, if specified. |
enqueue_time | time | Time when the storage task has been inserted into the queue. |
durability | string | The durability requirement for this storage task (applies to writes). |
log_id | log_id | Log ID that the storage task will perform writes/reads on. |
lsn | lsn | LSN that the storage task will act on. The specific meaning of this field varies depending on the task type. |
client_id | string | ClientID of the client that initiated the storage task. |
client_address | string | Address of the client that initiated the storage task. |
extra_info | string | Other information specific to particular task type. |
stored_logs
List of logs that have at least one record currently present in LogsDB, per shard. Doesn't include internal logs (metadata logs, event log, config log). Note that it is possible that all the existing records are behind the trim point but haven't been removed from the DB yet (by dropping or compacting partitions); see also the "rocksdb-partition-compaction-schedule" setting.
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log ID present in this shard. |
shard | long | Shard that contains this log. |
highest_lsn | lsn | Highest LSN that this shard has ever seen for this log. |
highest_partition | long | ID of the highest LogsDB partition that contains at least one record for this log. You can use the "partitions" table to inspect LogsDB partitions. |
highest_timestamp_approx | time | Approximate value of the highest timestamp of records for this log on this shard. This is an upper bound, as long as timestamps are non-decreasing with LSN in this log. Can be overestimated by up to "rocksdb-partition-duration" setting. |
sync_sequencer_requests
List the currently running SyncSequencerRequests on that cluster. See "logdevice/common/SyncSequencerRequest.h".
Column | Type | Description |
---|---|---|
node_id | int | Node ID this row is for. |
log_id | log_id | Log ID the SyncSequencerRequest is for. |
until_lsn | lsn | Next LSN retrieved from the sequencer. |
last_released_lsn | lsn | Last released LSN retrieved from the sequencer. |
last_status | string | Status of the last GetSeqStateRequest performed by SyncSequencerRequest. |