Configuration settings

Admin API/server

Name	Description	Default	Notes
admin-port	TCP port on which the server listens to for admin commands, supports commands over SSL	6440	requires restart, server only
admin-unix-socket	Path to the unix domain socket the server will use to listen for admin thrift interface		requires restart, server only
disable-maintenance-log-trimming	Disable trimming of the maintenance log	false	server only
enable-cluster-maintenance-state-machine	Enables the internal state replicated state machine that holds the maintenance definitions requested by the rebuilding supervisor or via the admin API. Enabling the state machine will also enable posting internal maintenance requests instead of writing to event log directly	false	requires restart, server only
enable-maintenance-manager	Start Maintenance Manager. This will automatically enable the maintenance state machine as well (--enable-cluster-maintenance-state-machine).	false	requires restart, server only
enable-safety-check-periodic-metadata-update	Safety check to update its metadata cache periodically	false	server only
maintenance-log-max-delta-bytes	How many bytes of deltas to keep in the maintenance log before we snapshot it.	10485760	server only
maintenance-log-max-delta-records	How many delta records to keep in the maintenance log before we snapshot it.	5000	server only
maintenance-log-snapshotting	Allow the maintenance log to be snapshotted onto a snapshot log. This requires the maintenance log group to contain two logs, the first one being the snapshot log and the second one being the delta log.	false	server only
maintenance-log-snapshotting-period	Controls time based snapshotting. New maintenancelog snapshot will be created after this period if there are new deltas	1h	server only
maintenance-manager-metadata-nodeset-update-period	The period of how often to check if metadata nodeset update is required	2min	server only
maintenance-manager-reevaluation-timeout	Timeout after which a new run is scheduled in MaintenanceManager. Used for periodic reevaluation of the state in the absence of any state changes	2min	server only
max-unavailable-sequencing-capacity-pct	The percentage of the sequencing capacity that is allowed to be taken down by operations, safety checker will take into account DEAD nodes as well. This means that if this value is 25, then safety checker will deny maintenances that will may take down more sequencer nodes. The percentage takes into account the weight of the sequencer nodes, so does not necessarily equals 25% of the number of sequencer nodes	25	server only
max-unavailable-storage-capacity-pct	The percentage of the storage that is allowed to be taken down by operations, safety checker will take into account DEAD nodes as well. This means that if this value is 25, then safety checker will deny maintenances that will may take down more storage nodes. The percentage takes into account the weight of shards in the storage nodes, so does not necessarily equals 25% of the number of storage nodes	25	server only
read-metadata-from-sequencers	Safety checker to read the metadata of logs directly from sequencers.	true	server only
safety-check-failure-sample-size	The number of sample epochs returned by the Maintenance API for each maintenance if safety check blocks the operation.	10	server only
safety-check-max-batch-size	The maximum number of logs to be checked in a single batch. Larger batches mean faster performance but means blocking the CPU thread pool for longer (not yielding often enough)	15000	server only
safety-check-max-logs-in-flight	The number of concurrent logs that we runs checks against during execution of the CheckImpact operation either internally during a maintenance or through the Admin API's checkImpact() call	1000	server only
safety-check-metadata-update-period	The period between automatic metadata updates for safety checker internal cache	10min	server only
safety-margin-low-pri-scopes	This is the extra safety margin that will be accounted for when evaluatingmaintenances with LOW priority. The value represents how many extra failure domains that the system can still lose in addition to the impact of the requested maintenance before losing availability	1	server only
safety-margin-medium-pri-scopes	This is the extra safety margin that will be accounted for when evaluatingmaintenances with MEDIUM priority. The value represents how many extra failure domains that the system can still lose in addition to the impact of the requested maintenance before losing availability	0	server only
use-force-restore-rebuilding-flag	Allows maintenance manager to set the FORCE_RESTORE rebuilding flag.This is a transitional setting that will be removed after all servershave support for this setting.	false	server only

Batching and compression

Name	Description	Default	Notes
buffered-writer-bg-thread-bytes-threshold	BufferedWriter can send batches to a background thread. For small batches, where the overhead dominates, this will just slow things down. If the total size of the batch is less than this, it will constructed / compressed on the Worker thread, blocking other appends to all logs in that shard. If larger, it will be enqueued to a helper thread.	4096
buffered-writer-zstd-level	Zstd compression level to use in BufferedWriter.	1
sequencer-batching	Accumulate appends from clients and batch them together to create fewer records in the system. This setting is only used when the log group doesn't override it	false	server only
sequencer-batching-compression	Compression setting for sequencer batching (if used). It can be 'none' for no compression; 'zstd' for ZSTD; 'lz4' for LZ4; or lz4_hc for LZ4 High Compression. The default is ZSTD. When enabled, this gets applied to the first new batch. This setting is only used when the log group doesn't override it	zstd	server only
sequencer-batching-passthru-threshold	Sequencer batching (if used) will pass through any appends with payload size over this threshold (if positive). This saves us a compression round trip when a large batch comes in from BufferedWriter and the benefit of batching and recompressing would be small.	-1	server only
sequencer-batching-size-trigger	Sequencer batching (if used) flushes buffered appends for a log when the total amount of buffered uncompressed data reaches this many bytes (if positive). When enabled, this gets applied to the first new batch. This setting is only used when the log group doesn't override it	-1	server only
sequencer-batching-time-trigger	Sequencer batching (if used) flushes buffered appends for a log when the oldest buffered append is this old. When enabled, this gets applied to the first new batch. This setting is only used when the log group doesn't override it	1s	server only
socket-batching-time-trigger	Socket batching allows us to batch data before flushing it to the socket to save CPU. It increases the amount of memory consumed. And introduces additional latency when sending messages.	0s

Configuration

Name	Description	Default	Notes
admin-client-capabilities	If set, the client will have the capabilities for administrative operations such as changing NodesConfiguration. Usually used by emergency tooling. Beware that admin clients use a different NodesConfigurationStore that may not support a large fan-out, so this settings shouldn't be applied to large number of clients (e.g., through client_settings in settings config).	false	client only
check-metadata-log-empty-timeout	Timeout for request that verifies that a metadata log does not already exist for a log that is presumed new and whose metadata provisioning has been initiated by a sequencer activation	300s	server only
client-default-dscp	Use default DSCP to setup to client sockets at Sender.Range was defined by https://tools.ietf.org/html/rfc4594#section-1.4.4	0	requires restart
config-path	location of the cluster config file to use. Format: [file:]<path-to-config-file> or configerator:<configerator-path>		CLI only, requires restart, server only
enable-logsconfig-manager	If true, logdeviced will load the logs configuration from the internal replicated storage and will ignore the logs section in the config file. This also enables the remote management API for logs config.	true
enable-node-self-registration	If set, the node will register itself in the config if it doesn't find itself there. Otherwise it will crash. This requires --enable-nodes-configuration-manager=true	false	requires restart, experimental, server only
enable-nodes-configuration-manager	If set, NodesConfigurationManager and its workflow will be enabled.	true	requires restart
file-config-update-interval	interval at which to poll config file for changes (if reading config from file on disk	10000ms
hard-exit-on-node-configuration-mismatch	Just quickly exits whenever the server's NodeID and/or service discovery configuration changes. Necessary so that we don't get stuck for shutdown timeout if flusing IOs on shared storage is slow or failing. This allowes another storage head to take over the shared storage as quickly as possible. Ultimately IO fencing will solve this more gracefully.	false	server only
initial-config-load-timeout	maximum time to wait for initial server configuration until giving up	15s	CLI only, requires restart, server only
logsconfig-manager-grace-period	Grace period before making a change to the logs config available to the server.	0ms
logsconfig-max-delta-bytes	How many bytes of deltas to keep in the logsconfig deltas log before we snapshot it.	10485760	server only
logsconfig-max-delta-records	How many delta records to keep in the logsconfig deltas log before we snapshot it.	5000	server only
logsconfig-snapshotting-period	Controls time based snapshotting. New logsconfig snapshot will be created after this period if there are new log configuration deltas	1h	server only
max-sequencer-background-activations-in-flight	Max number of concurrent background sequencer activations to run. Background sequencer activations perform log metadata changes (reprovisioning) when the configuration attributes of a log change.	20	server only
nodes-configuration-init-retry-timeout	timeout settings for the exponential backoff retry behavior for initializing Nodes Configuration for the first time	500ms..5s
nodes-configuration-init-timeout	defines the maximum time allowed on the initial nodes configuration fetch.	60s
nodes-configuration-manager-intermediary-shard-state-timeout	Timeout for proposing the transition for a shard from an intermediary state to its 'destination' state	180s
nodes-configuration-manager-store-polling-interval	Polling interval of NodesConfigurationManager to NodesConfigurationStore to read NodesConfiguration	3s
nodes-configuration-seed-servers	The seed string that will be used to fetch the initial nodes configuration. It can be in the form string:,,etc. Or you can provide an smc tier via 'smc:<smc_tier>'. If it's empty, NCM client bootstrapping is not used.		client only
on-demand-logs-config	Set this to true if you want the client to get log configuration on demand from the server when log configuration is not included in the main config file.	false	requires restart, client only
on-demand-logs-config-retry-delay	When a client's attempt to get log configuration information from server on demand fails, the client waits this much before retrying.	5ms..1s	client only
remote-logs-config-cache-ttl	The TTL for cache entries for the remote logs config. If the logs config is not available locally and is fetched from the server, this will determine how fresh the log configuration used by the client will be.	60s	requires restart, client only
rsm-snapshot-request-timeout	Overall timeout for GetRsmSnapshotRequest	30s
rsm-snapshot-request-wave-timeout	timeout settings for Fetching RSM snapshot via MessageBased Store	2s..5s
sequencer-background-activation-retry-interval	Retry interval on failures while processing background sequencer activations for reprovisioning.	500ms	server only
sequencer-epoch-store-write-retry-delay	The retry delay for sequencer writing log metadata into the epoch store during log reconfiguration.	5s..1min-2x	server only
sequencer-historical-metadata-retry-delay	The retry delay for sequencer reading metadata log for historical epoch metadata during log reconfiguration.	5s..1min-2x	server only
sequencer-metadata-log-write-retry-delay	The retry delay for sequencer writing into its own metadata log during log reconfiguration.	500ms..30s-2x	server only
sequencer-reactivation-delay-secs	Some sequencer reactivations may be postponed when the changes that triggered the reactivation are not important enough to be propagated immediately. E.g., changes to replication factor or window size, need to be made immediately visible on the other hand changes changes to the nodeset due to say the 'exclude_from_nodeset' flag being set as part of a passive drain can be postponed. If the reactivations can be postponed then the delay is chosen to be a random delay seconds between the above range. If 0 then don't postpone	60s..3600s	server only
server-based-nodes-configuration-polling-wave-timeout	timeout settings for server based Nodes Configuration Store's multi-wavebackoff retry behavior	500ms..10s
server-based-nodes-configuration-store-polling-responses	how many successful responses for server based Nodes Configuration Storepolling to wait for each round	2
server-based-nodes-configuration-store-timeout	The timeout of the Server Based Nodes Configuration Store's NODES_CONFIGURATION polling round.	60s
server-default-dscp	Use default DSCP to setup to server sockets at Sender.Range was defined by https://tools.ietf.org/html/rfc4594#section-1.4.4	0	requires restart, server only
server_based_nodes_configuration_store_polling_extra_requests	how many extra requests to send for server based Nodes Configuration Store polling in addition to the required response for each wave	1
shutdown-on-node-configuration-mismatch	Gracefully shutdown whenever the server's NodeID and/or service discovery configuration changes.	true	server only
sleep-secs-after-self-registeration	If set, the node will sleep for these many secs after self registeration	5s	server only
tls-ticket-seeds-path	The path to the TLS ticket seed file, only useful when use-tls-ticket-seeds is set. Read TLSCredProcessor::processTLSTickets to understand the format of the seed.		requires restart, server only
use-nodes-configuration-manager-nodes-configuration	If true and enable_nodes_configuration_manager is set, logdevice will use the nodes configuration from the NodesConfigurationManager.	true
use-tls-ticket-seeds	If enabled, TLS tickets seed will be read and used to initiate TLS ticket encryption keys. Once enabled, and all the servers share the same seeds file, they will all use the same ticket encrpytion key, and they all can decrypt each others generated tickets enabling clients to resume TLS sessions between all the servers in the same cluster.	false	requires restart, server only
zk-config-polling-interval	polling and retry interval for Zookeeper config source	1000ms	CLI only
zk-vsc-max-retries	Number of transient error retries for the zookeeper versioned config store.	3	requires restart

Core settings

Name	Description	Default	Notes
admin-enabled	Is Admin API enabled?	true	requires restart, experimental, server only
append-timeout	Timeout for appends. If omitted the client timeout will be used.		client only
enable-hh-wheel-backed-timers	Enables the new version of timers which run on a different thread and use HHWheelTimer backend.	true	requires restart
enable-store-histograms-calculations	Enables estimation of store timeouts per worker per node.	false	server only
external-loglevel	One of the following: critical, error, warning, info, debug, none	critical
findkey-timeout	Findkey API call timeout. If omitted the client timeout will be used.		client only
log-file	write server error log to specified file instead of stderr		server only
loglevel	One of the following: critical, error, warning, info, debug, none	info	server only
logsconfig-timeout	Timeout for LogsConfig API requests. If omitted the client timeout will be used.		client only
meta-api-timeout	Timeout for trims/isLogEmpty/tailLSN/datasize API/etc. If omitted the client timeout will be used.		client only
my-location	{client-only setting}. Specifies the location of the machine running the client. Used for determining whether to use SSL based on --ssl-boundary. Also used in local SCD reading. Format: "{region}.{dc}.{cluster}.{row}.{rack}".		requires restart, client only
port	TCP port on which the server listens for non-SSL clients	16111	CLI only, requires restart, server only
rsm-force-all-send-all	Forces ALL_SEND_ALL mode for read streams associated with RSM.	true	requires restart
rsm-snapshot-enable-dual-writes	Decides whether snapshots should be written to log based store as well(to roll back from local store to log based in case of emergency	true	server only
rsm-snapshot-store-type	One of the following: legacy (use legacy way of storing and retrieving snapshots from a log), log (use Log Based snapshot store and point queries to fetch snapshots instead of tailing), message (Message Based for bootstrapping RSM snapshot from a Remote cluster host)local-store (From snapshot stored in local store)	log	requires restart
server-id	optional server ID, reported by INFO admin command		requires restart, server only
shutdown-timeout	amount of time to wait for the server to shut down before terminating the process. Consider modifying --time-delay-before-force-abort when changing this value.	120s	server only
store-histogram-min-samples-per-bucket	How many stores should the store histogram wait for before reporting latency estimates	30	server only
time-delay-before-force-abort	Time delay before force abort of remaining work is attempted during shutdown. The value is in 50ms time periods. The quiescence condition is checked once every 50ms time period. When the timer expires for the first time, all pending requests are aborted and the timer is restarted. On second expiration all remaining TCP connections are reset (RST packets sent).	400	server only
unmap-caches	unmap RocksDB block cache before dumping core (reduces core file size)	true	server only
user	user to switch to if server is run as root		requires restart, server only

Epoch Store

Name	Description	Default	Notes
epoch-store-double-write-new-serialization-format	If set, epoch stores will double write any data it modifies to its corresponding znode and the data serialized with the new serialization format to the parent znode	false	server only
zk-create-root-znodes	If "false", the root znodes for a tier should be pre-created externally before logdevice can do any ZooKeeper epoch store operations	true	server only

Failure detector

Name	Description	Default	Notes
cluster-state-refresh-interval	how frequently to search for the sequencer in case of an append timeout	10s	client only
enable-initial-get-cluster-state	Enable executing a GetClusterState request to retrieve the state of the cluster as soon as the client is created	true	client only
failover-blacklist-threshold	How many gossip intervals to ignore a node for after it performed a graceful failover	100	server only
failover-wait-time	How long to wait for the failover request to be propagated to other nodes	3s	server only
gcs-wait-duration	How long to wait for get-cluster-state reply to come, to initialize state of cluster nodes. Bringup is sent after this reply comes or after timeout	1s	server only
gossip-include-rsm-versions-frequency	How frequently to send RSM and NCM version information in a GOSSIP message. If the value is 10, it means the versions will be present in 1/10th of the GOSSIP_Messages.	10	server only
gossip-interval	How often to send a gossip message. Lower values improve detection time, but make nodes more chatty.	100ms	requires restart, server only
gossip-intervals-without-processing-threshold	How many intervals is a node allowed to go through without processinggossip messages. If this is crossed, the node will be marked as DEADeven if it's still sending OUT gossip messages.	50	server only
gossip-logging-duration	How long to keep logging FailureDetector/Gossip related activity after server comes up.	0s	server only
gossip-mode	How to select a node to send a gossip message to. One of: 'round-robin', 'random' (default)	random	requires restart, server only
gossip-threshold	Specifies after how many gossip intervals of inactivity a node is marked as dead. Lower values reduce detection time, but make false positives more likely.	30	server only
gossip-time-skew	How much delay is acceptable in receiving a gossip message.	10s	server only
ignore-isolation	Ignore isolation detection. If set, a sequencer will accept append operations even if a majority of nodes appear to be dead.	false	server only
min-gossips-for-stable-state	After receiving how many gossips, should the FailureDetector consider itself stable and start doing state transitions of cluster nodes based on incoming gossips.	3	server only
suspect-duration	How long to keep a node in an intermediate state before marking it as available. Larger values make the cluster less prone to node flakiness, but extend the time needed for sequencer nodes to start participating.	10s	server only

LogsDB

Name	Description	Default	Notes
manual-compact-interval	minimal interval between consecutive manual compactions on log stores that are out of disk space	1h	server only
rocksdb-background-wal-sync	Deprecated and ignored.	true	server only
rocksdb-directory-consistency-check-period	LogsDB will compare all on-disk directory entries with the in-memory directory no more frequently than once per this period of time.	5min	server only
rocksdb-free-disk-space-threshold-low	Keep free disk space above this fraction of disk size by marking node full if we exceed it, and let the sequencer initiate space-based retention. Only counts logdevice data, so storing other data on the disk could cause it to fill up even with space-based retention enabled. 0 means disabled.	0	server only
rocksdb-io-tracing-shards	List of shards for which to enable IO tracing. 'all' to enable for all shards, 'none' or empty string to disable for all shards. IO tracing prints information about every sufficiently slow (see rocksdb-io-tracing-threshold) IO operation (like file read() and write() calls) to the log at info level.	all	server only
rocksdb-io-tracing-stall-threshold	If this setting is nonzero, and rocksdb-io-tracing-shards is enabled, IO tracing will spin up a background thread to periodically poll the list of active IO operations and report when an operation is stuck for at least this long. The purpose is to detect stuck IO operations, which wouldn't be reported by the regular IO tracing because it only reports an operation after it completes. If set to '0', stall detection will be disabled, and no background thread will be created.	30s	server only
rocksdb-io-tracing-threshold	IO tracing (see rocksdb-io-tracing-shards) will report only operations that took at least this long. Set to '0' to report all operations.	5s	server only
rocksdb-metadata-compaction-period	Metadata column family will be flushed+compacted at least this often. This is needed to mitigate performance issues in rare cases. Full scenario: suppose all writes to this node stopped; eventually all logs will be fully trimmed, and logsdb directory will be emptied by deleting each key; these deletes will usually be flushed in sst files different than the ones where the original entries are; this makes iterator operations very expensive because merging iterator has to skip all these deleted entries in linear time; this is especially bad for findTime. If we compact every hour, this badness would last for at most an hour.	1h	server only
rocksdb-new-partition-timestamp-margin	Newly created partitions will get starting timestamp `now + new\_partition\_timestamp\_margin`. This absorbs the latency of creating partition and possible small clock skew between sequencer and storage node. If creating partition takes longer than that, or clock skew is greater than that, FindTime may be inaccurate. For reference, as of August 2017, creating a partition typically takes ~200-800ms on HDD with ~1100 existing partitions.	10s	server only
rocksdb-num-metadata-locks	number of lock stripes to use to perform LogsDB metadata updates	256	requires restart, server only
rocksdb-partition-compaction-schedule	If set, indicate that the node wil run compaction. This is a list of durations indicating at what age to compact partition. e.g. "3d, 7d" means that each partition will be compacted twice: when all logs with backlog of up to 3 days are trimmed from it, and when all logs with backlog of up to 7 days are trimmed from it. "auto" (default) means use all backlog durations from config. "disabled" disables partition compactions.	auto	server only
rocksdb-partition-compactions-enabled	perform background compactions for space reclamation in LogsDB	true	server only
rocksdb-partition-count-soft-limit	If the number of partitions in a shard reaches this value, some measures will be taken to limit the creation of new partitions: partition age limit is tripled; partition file limit is ignored; partitions are not pre-created on startup; partitions are not prepended for records with small timestamp. This limit is intended mostly as protection against timestamp outliers: e.g. if we receive a STORE with zero timestamp, without this limit we would create over a million partitions to cover the time range from 1970 to now.	2000	server only
rocksdb-partition-duration	create a new partition when the latest one becomes this old; 0 means infinity	15min	server only
rocksdb-partition-file-limit	create a new partition when the number of level-0 files in the existing partition exceeds this threshold; 0 means infinity	200	server only
rocksdb-partition-flush-check-period	How often a flusher thread will go over all shard looking for memtables to flush. Flusher thread is responsible for deciding to flush memtables on various triggers like data age, idle time and size. If flushes are managed by logdevice, flusher thread is responsible for persisting any data on the system. This setting is tuned based on 3 things: memory size on node, write throughput node can support, and how fast data can be persisted. 0 disables all manual flushes done in tests to disable all flushes in the system.	200ms	server only
rocksdb-partition-hi-pri-check-period	how often a background thread will check if new partition should be created	2s	server only
rocksdb-partition-lo-pri-check-period	how often a background thread will trim logs and check if old partitions should be dropped or compacted, and do the drops and compactions	30s	server only
rocksdb-partition-partial-compaction-file-num-threshold-old	don't consider file ranges for partial compactions (used during rebuilding) that are shorter than this for old partitions (>1d old).	100	server only
rocksdb-partition-partial-compaction-file-num-threshold-recent	don't consider file ranges for partial compactions (used during rebuilding) that are shorter than this, for recent partitions (<1d old).	10	server only
rocksdb-partition-partial-compaction-file-size-threshold	the largest L0 files that it is beneficial to compact on their own. Note that we can still compact larger files than this if that enables us to compact a longer range of consecutive files.	50000000	server only
rocksdb-partition-partial-compaction-largest-file-share	Partial compaction candidate file ranges that contain a file that comprises a larger proportion of the total file size in the range than this setting, will not be considered.	0.7	server only
rocksdb-partition-partial-compaction-max-file-size	the maximum size of an l0 file to consider for compaction. If not set, defaults to 2x --rocksdb-partition-partial-compaction-file-size-threshold	0	server only
rocksdb-partition-partial-compaction-max-files	the maximum number of files to compact in a single partial compaction	120	server only
rocksdb-partition-partial-compaction-max-num-per-loop	How many partial compactions to do in a row before re-checking if there are higher priority things to do (like dropping partitions). This value is not important; used for tests.	4	server only
rocksdb-partition-partial-compaction-old-age-threshold	A partition is considered 'old' from the perspective of partial compaction if it is older than the above hours. Otherwise it is considered a recent partition. Old and recent partitions have different thresholds: partition_partial_compaction_file_num_threshold_old and partition_partial_compaction_file_num_threshold_recent, when being considered for partial compaction.	6h	server only
rocksdb-partition-partial-compaction-stall-trigger	Stall rebuilding writes if partial compactions are outstanding in at least this many partitions. 0 means infinity.	50	server only
rocksdb-partition-redirty-grace-period	Minimum guaranteed time period for a node to re-dirty a partition after a MemTable is flushed without incurring a synchronous write penalty to update the partition dirty metadata.	5s	server only
rocksdb-partition-size-limit	create a new partition when size of the latest partition exceeds this threshold; 0 means infinity	6G	server only
rocksdb-partition-timestamp-granularity	minimum and maximum timestamps of a partition will be updated this often	5s	server only
rocksdb-pinned-memtables-limit-percent	Memory budget for flushed memtables pinned by iterators, as percentage of --rocksdb-memtable-size-per-node. More precisely, each shard will reject writes if its total memtable size (active+flushing+pinned) is greater than rocksdb-memtable-size-per-node/num-shards*(1 + rocksdb-pinned-memtables-limit-percent/100). Currently set to a high value because rejecting writes is too disruptive in practice, better to use more memory and hope we won't OOM. TODO (#45309029): make cached iterators not pin memtables, then decrease this setting back to something like 50%.	200	server only
rocksdb-prepended-partition-min-lifetime	Avoid dropping newly prepended partitions for this amount of time.	300s	server only
rocksdb-print-details	If true, print information about each flushed memtable and each partial compaction. It's not very spammy, an event every few seconds at most. The same events are also always logged by rocksdb to LOG file, but with fewer logdevice-specific details.	false	server only
rocksdb-proactive-compaction-enabled	If set, indicate that we're going to proactively compact all partitions (besides two latest) that were never compacted. Compacting will be done in low priority background thread	false	server only
rocksdb-read-find-time-index	If set to true, the operation findTime will use the findTime index to seek to the LSN instead of doing a binary search in the partition.	false	server only
rocksdb-read-only	Open LogsDB in read-only mode	false	requires restart, server only
rocksdb-sbr-force	If true, space based retention will be done on the storage side, irrespective of whether sequencer initiated it or not. This is meant to make a node's storage available in case there is a critical bug.	false	experimental, server only
rocksdb-test-clamp-backlog	Override backlog duration of all logs to be <= this value. This is a quick hack for testing, don't use in production! The override applies only in a few places, not to everything using the log attributes. E.g. disable-data-log-rebuilding is not aware of this setting and will use full retention from log attributes.	0	server only
rocksdb-track-iterator-versions	Track iterator versions for the "info iterators" admin command	false	server only
rocksdb-unconfigured-log-trimming-grace-period	A grace period to delay trimming of records that are no longer in the config. The intent is to allow the oncall enough time to restore a backup of the config, in case the log(s) shouldn't have been removed.	4d	server only
rocksdb-use-age-size-flush-heuristic	If true, we use `age * size` of the MemTable to decide if we need to flush it, otherwise we use `age` for that.	true	experimental, server only
rocksdb-use-copyset-index	If set to true, the read path will use the copyset index to skip records that do not pass copyset filters. This greatly improves the efficiency of reading and rebuilding if records are large (1KB or bigger). For small records, the overhead of maintaining the copyset index negates the savings. If this setting is enabled while --rocksdb-write-copyset-index is not, it will be ignored. After enabling --rocksdb-write-copyset-index, all NEW partitions will have copyset idnex enabled	true	requires restart, server only
rocksdb-verify-checksum-during-store	If true, verify checksum on every store. Reject store on failure and return E::CHECKSUM_MISMATCH.	true	server only
rocksdb-wal-buffer-size	WAL buffer size when rocksdb-wal-buffering is set to background-append or delayed-append. Some things to consider: (1) We allocate 2 buffers of this size for each WAL file. (2) There may occasionally be a few such files open at a time, but most of the time just one. (3) Normally, a new WAL file is created on every flush, which typically happens every few hundred MB, but in extreme cases may be happeninig every couple MB during rebuilding. (4) If append-store-durability is set to 'memory', WAL gets very few writes - only occasional small pieces of metadata. (5) If this buffer size is set too small, writes may stall if a particularly slow background write or sync stalls long enough for foreground thread to run out of buffer space. (6) If this buffer size is set too big, the allocation/deallocation of the oversized buffers may be a significant overhead when creating WAL files that never grow big enough to use all of the buffer. (The latter limitation is unnecessary: we could make the buffer grow lazily; that is not implemented at the moment.)	8M	server only
rocksdb-wal-buffering	What write-ahead log operations to offload from write path to background thread. 'none' - no offloading, 'background-range-sync' - offload optional sync_file_range() calls, 'background-append' - offload writes, 'delayed-append' - offload and batch writes.	background-range-sync	server only
rocksdb-worker-blocking-io-threshold	Log a message if a blocking file deletion takes at least this long on a Worker thread	10ms	server only
sbr-low-watermark-check-interval	Time after which space based trim check can be done on a nodeset	60s	server only
sbr-node-threshold	threshold fraction of full nodes which triggers space-based retention, if enabled (sequencer-only option), 0 means disabled	0	server only

Monitoring

Name	Description	Default	Notes
client-readers-flow-max-acceptable-time-lag-per-tag	Map that establishes the maximum acceptable time lag for each monitoring tag. A reader that passes the maximum acceptable time lag will be considered unhealthy for the purpose of increasing weight when pushing samples. See 'client-readers-flow-tracer-unhealthy-publish-weight'.		client only
client-readers-flow-tracer-GSS-skip-remote-preemption-checks	If set, skips remote preemption checks (aka CHECK SEALs) on GSSs issued by ClientReadersFlowTracer.	true	client only
client-readers-flow-tracer-high-pri-max-lag	Max allowed amount of lag for high priority readers.	max	client only
client-readers-flow-tracer-lagging-metric-num-sample-groups	Maximum number of samples that are kept by ClientReadersFlowTracer for computing relative reading speed in relation to writing speed. See client_readers_flow_tracer_lagging_slope_threshold.	3	client only
client-readers-flow-tracer-lagging-metric-sample-group-size	Number of samples in ClientReadersFlowTracer that are aggregated and recorded as one entry. See client-readers-flow-tracer-lagging-metric-sample-group-size.	20	client only
client-readers-flow-tracer-lagging-slope-threshold	If a reader's lag increase at at least this rate, the reader is considered lagging (rate given as variation of time lag per time unit). If the desired read ratio needs to be x% of the write ratio, set this threshold to be (1 - x / 100).	-0.3	client only
client-readers-flow-tracer-period	Period for logging in logdevice_readers_flow scuba table and for triggering certain sampling actions for monitoring. Set it to 0 to disable feature.	0s	client only
client-readers-flow-tracer-unhealthy-publish-weight	Weight given to traces of unhealthy readers when publishing samples (for improved debuggability).	5.0	client only
disable-trace-logger	If disabled, NoopTraceLogger will be used, otherwise FBTraceLogger is used	false	requires restart
enable-health-monitor	Toggle use of HealthMonitor to determine node status on server-side.	true	requires restart, server only
health-monitor-max-delay	Maximum tolerated delay inbetween health monitor loops.	50ms	requires restart, server only
health-monitor-max-overloaded-worker-percentage	Maximum tolarable percent of HM detected overloaded workers. If the percentage of overloaded workers rises above this value the whole node transitions into an overloaded state in the HM. Setting this percentage to a value greater than 1.0 ensures that overloaded workers are excluded from HM decision-making.	0.3	requires restart, server only
health-monitor-max-queue-stall-duration	Value of summed queue stalls over a period of health-monitor-poll-interval-ms that if generated by less than health-monitor-max-queue-stalls queued requests results in worker being counted as overloaded by health monitor	200ms	requires restart, server only
health-monitor-max-queue-stalls-avg	Maximum average of queue stalls in health-monitor-poll-interval-ms that are not counted towards overload even if their sum exceeds health-monitor-max_queue-stall-duration.	100ms	requires restart, server only
health-monitor-max-stalled-worker-percentage	Maximum tolarable percent of HM detected stalled workers. If the percentage of stalled workers rises above this value the whole node transitions into an UNDEALTHY state in the HM. Setting this percentage to a value greater than 1.0 ensures that stalled workers are excluded from HM decision-making.	0.2	requires restart, server only
health-monitor-max-stalls-avg	Maximum average of worker stalls in health-monitor-poll-interval-ms that are not counted towards UNHEALTHY.	90ms	requires restart, server only
health-monitor-poll-interval	Interval after which health monitor detects issues on node.	500ms	requires restart, server only
message-tracing-log-level	For messages that pass the message tracing filters, emit a log line at this level. One of: critical, error, warning, notify, info, debug, spew	info
message-tracing-peers	Emit a log line for each sent/received message to/from the specified address(es). Separate different addresses with a comma, prefix unix socket paths with 'unix://'. An empty unix path will match all unix paths
message-tracing-types	Emit a log line for each sent/received message of the type(s) specified. Separate different types with a comma. 'all' to trace all messages. Prefix the value with '~' to trace all types except the given ones, e.g. '~WINDOW,RELEASE' will trace messages of all types except WINDOW and RELEASE.
overload-detector-freshness-factor	Multiple of recv windows that need to be read from socket since last sample to consider socket for percentile analysis in OverloadDetector. See overload-detector-percentile.	1.0	client only
overload-detector-percentile	Percentile of active connections for which we compare occupancy with overload-detector-threshold to do the final assessment of overload. See overload-detector-freshness-factor.	99	client only
overload-detector-period	Sampling period for OverloadDetector	60s	client only
overload-detector-threshold	Minimum recv-q occupancy to declare socket overloaded (in percent). See overload-detector-percentile.	80	client only
publish-single-histogram-stats	If true, single histogram values will be published alongside the rate values.	false
reader-stalled-grace-period	Amount of time we wait before declaring a reader stalled because we can't read the metadata or data log. When this grace period expires, the client stat "read_streams_stalled" is bumped and record to scuba	30s
reader-stuck-threshold	Amount of time we wait before we report a read stream that is considered stuck.	121s
request-exec-threshold	Request Execution time beyond which it is considered slow, and 'worker_slow_requests' stat is bumped	10ms
request-queue-warning-time-limit	Maximum amount of time that the request can be delayed without printing the warning log. After this time, the warning log will be printed	20ms
shadow-client-timeout	Timeout to use for shadow clients. See traffic-shadow-enabled.	30s	client only
slow-background-task-threshold	Background task execution time beyond which it is considered slow, and we log it	100ms
stats-collection-interval	How often to collect and submit stats upstream. Set to <=0 to disable collection of stats.	60s	requires restart
traffic-shadow-enabled	Controls the traffic shadowing feature. Defaults to false to disable shadowing on all clients writing to a cluster. Must be set to true to allow traffic shadowing, which will then be controlled on a per-log basic through parameters in LogsConfig.	false	client only
watchdog-abort-on-stall	Should we abort logdeviced if watchdog detected stalled workers.	false
watchdog-bt-ratelimit	Maximum allowed rate of printing backtraces.	10/120s	requires restart
watchdog-poll-interval	Interval after which watchdog detects stuck workers	5000ms	requires restart
watchdog-print-bt-on-stall	Should we print backtrace of stalled workers.	true

Network communication

Name	Description	Default	Notes
checksumming-blacklisted-messages	Used to control what messages shouldn't be checksummed at the protocol layer		requires restart, experimental
checksumming-enabled	A switch to turn on/off checksumming for all LogDevice protocol messages. If false: no checksumming is done, If true: checksumming-blacklisted-messages is consulted.	false	experimental
client-default-network-priority	Sets the default client network priority. Clients will connect to the server port associated with this priority, unless 'enable-port-based-qos' is false. Value must be one of 'low','medium', or 'high.
command-conn-limit	Maximum number of concurrent admin connections	32	server only
connect-throttle	timeout after it which two nodes retry to connect when they loose a a connection. Used in ConnectThrottle to ensure we don't retry too often. Needs restart to load the new values.	1ms..10s	requires restart
connect-timeout	connection timeout when establishing a TCP connection to a node	100ms
connect-timeout-retry-multiplier	Multiplier that is applied to the connect timeout after every failed connection attempt	3
connection-backlog	(server-only setting) Maximum number of incoming connections that have been accepted by listener (have an open FD) but have not been processed by workers (made logdevice protocol handshake).	2000	server only
connection-retries	the number of TCP connection retries before giving up	4
enable-dscp-reflection	If enabled, server will use DSCP TypeOfService used by client for communicationoIf disabled, server will use default DSCP value	true	requires restart, server only
enable-port-based-qos	Feature gate setting for allowing port-based QoS / connections per network priority. If disabled, all addresses will resolve to the default_data_address listed in nodes configuration. Note that this feature does not apply to connections between servers, only client to server.	false
handshake-timeout	LogDevice protocol handshake timeout	1s
idle-connection-keep-alive	How long inactive client-to-server connection will stay open before being shut down automatically.	5min	client only
include-destination-on-handshake	Include the destination node ID in the LogDevice protocol handshake. If the actual node ID of the connection target does not match the intended destination ID, the connection is terminated.	true
incoming-messages-max-bytes-limit	maximum byte limit of unprocessed messages within the system.	524288000	requires restart
inline-message-execution	Indicates whether message should be processed right after deserialization. Usually within new worker model all messages are processed after posting them into the work context. This option works only when worker context is run with previous eventloop architecture.	false	requires restart
max-protocol	maximum version of LogDevice protocol that the server/client will accept	103
max-time-to-allow-socket-drain	If a socket does not drain a complete message for max-time-to-allow-socket-drain. Then the socket is closed.	3min
min-bytes-to-drain-per-second	Refer socket-health-check-period for details.	1000000
min-socket-idle-threshold-percent	A socket is considered active if it had bytes pending in the socket above socket-idle-threshold for greater than min-socket-idle-threshold-percent of socket-health-check-period.	50
nagle	enable Nagle's algorithm on TCP sockets. Changing this setting on-the-fly will not apply it to existing sockets, only to newly created ones	false
outbuf-kb	max output buffer size (userspace extension of socket sendbuf) in KB. Changing this setting on-the-fly will not apply it to existing sockets, only to newly created ones	32768
outbuf-socket-min-kb	Minimum outstanding bytes per socket in kb. Global sender's budget will be ignored if socket's outstanding bytes is less than this value. Changing this setting on-the-fly will not apply it to existing sockets, only to newly created ones	1	server only
outbufs-limit-per-peer-type	If enabled, the outbytes-mb limit is split in half between client-to-server and server-to-server connections. If disabled, the limit is shared; in particular, a few misbehaving clients may cause the server to use up all its outbytes-mb and become unable to send to other servers.	true	server only
outbytes-mb	per-thread limit on bytes pending in output evbuffers (in MB)	512
rate-limit-idle-connection-closed	Max number of idle connections closed in single round of socket healh check. Set to 0 to disable closing of idle connections completely.	10	client only
rate-limit-socket-closed	Max number of sockets closed in a socket health check period.	1
rcvbuf-kb	TCP socket rcvbuf size in KB. Changing this setting on-the-fly will not apply it to existing sockets, only to newly created ones	-1
read-messages	read up to this many incoming messages before returning to libevent	128
sendbuf-kb	TCP socket sendbuf size in KB. Changing this setting on-the-fly will not apply it to existing sockets, only to newly created ones	-1
socket-health-check-period	Time between consecutive socket health check. Every socket-health-check-period, a socket is closed, if it was not draining for max-time-to-allow-socket-drain or it was active but the throughput during the time it was active dropped belowmin-bytes-to-drain-per-second due to network congestion.	1min
socket-idle-threshold	A socket is considered idle if number of bytes pending in the socket is below or equal to this threshold. This is used along with min_socket_idle_threshold_percent to find active socket and select them for health check. Check socket-health-check-period for more details.	1000000
tcp-keep-alive-intvl	TCP keepalive interval. The interval between successive probes.If negative the OS default will be used.	-1
tcp-keep-alive-probes	TCP keepalive probes. How many unacknowledged probes before the connection is considered broken. If negative the OS default will be used.	-1
tcp-keep-alive-time	TCP keepalive time. This is the time, in seconds, before the first probe will be sent. If negative the OS default will be used.	-1
tcp-user-timeout	The time in milliseconds that transmitted data may remain unacknowledged before TCP will close the connection. 0 for system default. -1 to disable. default is 5min = 300000	300000
use-dedicated-server-to-server-address	Temporary switch to roll out dedicated server-to-server address to running clusters with minor disruption. This setting will be removed soon in a future release as soon as the rollout is completed.	false	server only
use-tcp-keep-alive	Enable TCP keepalive for all connections	true

Node Registration

Name	Description	Default	Notes
address	[Only used when node self registration is enabled] The interface address that the server will be listening on for data, gossip, command and admin connections (unless overridden by unix sockets settings).		CLI only, requires restart, server only
client-thrift-api-port	[Only used when node self registration is enabled] TCP port the server will use to listen for incoming Thrift API requests from clients. A value of zero disables client Thrift API.	0	requires restart, server only
client-thrift-api-unix-socket	[Only used when node self registration is enabled] Path to the Unix domain socket the server will use to listen for incoming Thrift API requests from clients.		CLI only, requires restart, server only
gossip-port	[Only used when node self registration is enabled] TCP port on which the server listens for gossip connections. A value of zero means that the server will listen for socket on the data port.	0	requires restart, server only
gossip-unix-socket	[Only used when node self registration is enabled] Path to the unix domain socket the server will use to listen for gossip connections		CLI only, requires restart, server only
location	[Only used when node self registration is enabled] The location of the node. Check the documentation of NodeLocation::fromDomainString to understand the format.		CLI only, requires restart, server only
name	The name that the server will use to self register in the nodes configuration. This is the main identifier that the node uses to join the cluster. At any point of time, all the names in the config are unique. If a node joins with a name that's used by another running node, the new node will preempt the old one.		CLI only, requires restart, server only
node-version	[Only used when node self registration is enabled] The version provides better control over node self-registration logic. A node will only be allowed to update its attributes on joining the cluster its version is greater or equal than the current one. When omitted, will default to the current version if there is another node running with the same name, and to 0 otherwise.		CLI only, requires restart, server only
num-shards	[Only used when node self registration is enabled] defines how many storage shards this node will have. Sharding can be useful to distribute the IO load on multiple disks that are managed by the same daemon.	1	requires restart, server only
ports-per-network-priority	[Only used when node self registration is enabled] TCP ports for each network priority. Format is: ":,:,..." where is one of "low", "medium" and "high". Example: "low:7440,medium:5440"		requires restart, server only
roles	[Only used when node self registration is enabled] Defines whether the node is sequencer node, storage node or both. The roles are comma-separated. e.g. 'sequencer,storage'	sequencer,storage	CLI only, requires restart, server only
sequencer-weight	[Only used when node self registration is enabled] define a proportional value for the number of sequencers to be placed on the machine	1	requires restart, server only
server-thrift-api-port	[Only used when node self registration is enabled] TCP port the server will use to listen for incoming Thrift API requests from servers. A value of zero disables server Thrift API.	0	requires restart, server only
server-thrift-api-unix-socket	[Only used when node self registration is enabled] Path to the Unix domain socket the server will use to listen for incoming Thrift API requests from other servers.		CLI only, requires restart, server only
server-to-server-port	[Only used when node self registration is enabled] TCP port on which the server listens for server-to-server connections. A value of zero means that the server listens for other servers on the base address.	0	requires restart, server only
server-to-server-unix-socket	[Only used when node self registration is enabled] Path to the unix domain socket the server will use to listen for other servers.		CLI only, requires restart, server only
ssl-port	[Only used when node self registration is enabled] TCP port on which the server listens for SSL clients. A value of zero means that the server won't listen for SSL connections.	0	requires restart, server only
ssl-unix-socket	[Only used when node self registration is enabled] Path to the unix domain socket the server will use to listen for SSL clients		CLI only, requires restart, server only
storage-capacity	[Only used when node self registration is enabled] defines a proportional value for the amount of data to be stored compared to other machines. When e.g. total disk size is used as weight for machines with variable disk sizes, the storage will be used proportionally.	1	requires restart, server only
tags	[Only used when node self registration is enabled] Tags to be associated with this node. Tags are specified as a list of key-value pairs, separated by commas. Keys must not contain colons or commas and values can contain anything but commas. Values can be empty, but keys must not. Key-value pairs are specified as "key:value". Example: key_1:value_1,key_2:,key_3:value_3.		requires restart, server only
unix-addresses-per-network-priority	[Only used when node self registration is enabled] Unix socket addresses ports for each network priority. Format is: ":<path-to-socket>,:<path-to-socket>,..." where is one of "low", "medium" and "high". Example: "low:/tmp/socket_A,medium:/tmp/socket_B"		requires restart, server only

Performance

Name	Description	Default	Notes
allow-reads-on-workers	If false, all rocksdb reads are done from storage threads. If true, a cache-only reading attempt is made from worker thread first, and a storage thread task is scheduled only if the cache wasn't enough to fulfill the read. Disabling this can be used for: working around rocksdb bugs; working around latency spikes caused by cache-only reads being slow sometimes	true	experimental, server only
disable-check-seals	if true, 'get sequencer state' requests will not be sending 'check seal' requests that they normally do in order to confirm that this sequencer is the most recent one for the log. This saves network and CPU, but may cause getSequencerState() calls to return stale results. Intended for use in production emergencies only.	false	server only
findtime-force-approximate	(server-only setting) Override the client-supplied FindKeyAccuracy with FindKeyAccuracy::APPROXIMATE. This makes the resource requirements of FindKey requests small and predictable, at the expense of accuracy	false	server only
write-find-time-index	Set this to true if you want findTime index to be written. A findTime index speeds up findTime() requests by maintaining an index from timestamps to LSNs in LogsDB data partitions.	false	server only

Read path

Name	Description	Default	Notes
all-read-streams-debug-config-path	The config path for sampling all client read streams debug info		client only
all-read-streams-sampling-rate	Rate of sampling all client read streams debug info	100ms	client only
authoritative-status-overrides	Force the given authoritative statuses for the given shards. Comma-separated list of overrides, each override of form 'NS:' or 'NS-:'. E.g. 'N7:S0-15:UNDERREPLICATION,N8:S2:UNDERREPLICATION' will set status of shards 0-15 of node 7 and shard 2 of node 8 to UNDERREPLICATION. This is useful for recovering from situations where internal logs or metadata logs are unreadable because too many nodes are unavailable or lost their data. In such situation, use this setting to temporarily override the state of shards that are unavailable (not running logdeviced) to UNDERREPLICATION, then, optionally, write SHARD_UNRECOVERABLE events for the same shards to event log.		server only
client-epoch-metadata-cache-size	maximum number of entries in the client-side epoch metadata cache. Set it to 0 to disable the epoch metadata cache.	50000	requires restart, client only
client-initial-redelivery-delay	Initial delay to use when reader application rejects a record or gap	1s
client-max-redelivery-delay	Maximum delay to use when reader application rejects a record or gap	30s
client-read-buffer-size	number of records to buffer per read stream in the client object while reading. If this setting is changed on-the-fly, the change will only apply to new reader instances	512
client-read-flow-control-threshold	threshold (relative to buffer size) at which the client broadcasts window update messages (less means more often)	0.7
data-log-gap-grace-period	When non-zero, replaces gap-grace-period for data logs.	0ms
enable-all-read-streams-debug	Enable all read streams sampling of debug info for debugging readers.	false	client only
enable-read-throttling	Throttle Disk I/O due to log read streams	false	server only
gap-grace-period	gap detection grace period for all logs, including data logs, metadata logs, and internal state machine logs. Millisecond granularity. Can be 0.	100ms
grace-counter-limit	Maximum number of consecutive grace periods a storage node may fail to send a record or gap (if in all read all mode) before it is considered disgraced and client read streams no longer wait for it. If all nodes are disgraced or in GAP state, a gap record is issued. May be 0. Set to -1 to disable grace counters and use simpler logic: no disgraced nodes, issue gap record as soon as grace period expires.	2
log-state-recovery-interval	interval between consecutive attempts by a storage node to obtain the attributes of a log residing on that storage node Such 'log state recovery' is performed independently for each log upon the first request to start delivery of records of that log. The attributes to be recovered include the LSN of the last cumulatively released record in the log, which may have to be requested from the log's sequencer over the network.	500ms	requires restart, server only
max-record-bytes-read-at-once	amount of RECORD data to read from local log store at once	1048576	server only
metadata-log-gap-grace-period	When non-zero, replaces gap-grace-period for metadata logs.	0ms
output-max-records-kb	amount of RECORD data to push to the client at once	1024
reader-reconnect-delay	When a reader client loses a connection to a storage node, delay after which it tries reconnecting.	10ms..30s	client only
reader-retry-window-delay	When a reader client fails to send a WINDOW message, delay after which it retries sending it.	10ms..30s	client only
reader-started-timeout	How long a reader client waits for a STARTED reply from a storage node before sending a new START message.	30s..5min	client only
real-time-eviction-threshold-bytes	When the real time buffer reaches this size, we evict entries.	80000000	requires restart, experimental, server only
real-time-max-bytes	Max size (in bytes) of released records that we'll keep around to use for real time reads. Includes some cache overhead, so for small records, you'll store less record data than this.	100000000	requires restart, experimental, server only
real-time-reads-enabled	Turns on the experimental real time reads feature.	false	experimental, server only
rsm-scd-copyset-reordering	SCDCopysetReordering values that clients ask servers to use. Currently available options: none, hash-shuffle (default), hash-shuffle-client-seed. hash-shuffle results in only one storage node reading a record block from disk, and then serving it to multiple readers from the cache. hash-shuffle-client-seed enables multiple storage nodes to participate in reading the log, which can be benefit non-disk-bound workloads.	hash-shuffle	requires restart
scd-copyset-reordering-max	SCDCopysetReordering values that clients may ask servers to use. Currently available options: none, hash-shuffle (default), hash-shuffle-client-seed. hash-shuffle results in only one storage node reading a record block from disk, and then serving it to multiple readers from the cache. hash-shuffle-client-seed enables multiple storage nodes to participate in reading the log, which can be benefit non-disk-bound workloads.	hash-shuffle
unreleased-record-detector-interval	Time interval at which to check for unreleased records in storage nodes. Any log which has unreleased records, and for which no records have been released for two consecutive unreleased-record-detector-intervals, is suspected of having a dead sequencer. Set to 0 to disable check.	30s	server only

Reader failover

Name	Description	Default	Notes
reader-slow-shards-detection	If true, readers in SCD mode will detect shards that are very slow and may ask the other storage shards to filter them out	disabled	client only
reader-slow-shards-detection-moving-avg-duration	When slow shards detection is enabled, duration to use for the moving average	30min	client only
reader-slow-shards-detection-outlier-duration	When slow shards detection is enabled, amount of time that we'll consider a shard an outlier if it is slow.	1min..30min	client only
reader-slow-shards-detection-outlier-duration-decrease-rate	When slow shards detection is enabled, rate at which we decrease the time after which we'll try to reinstate an outlier in the read set. If the value is 0.25, for each second of healthy reading we will decrease that time by 0.25s.	0.25	client only
reader-slow-shards-detection-required-margin	When slow shards detection is enabled, sensitivity of the outlier detection algorithm. For instance, if set to 3.0, only consider an outlier a shard that is 300% slower than the others. The required margin is adaptive and may increase or decrease but will be capped at a minimum defined by this setting.	10.0	client only
reader-slow-shards-detection-required-margin-decrease-rate	Rate at which we decrease the required margin when we are healthy. If the value is 0.25 for instance, we will reduce the required margin by 0.25 for every second spent reading.	0.25	client only
scd-all-send-all-timeout	Timeout after which ClientReadStream fails over to asking all storage nodes to send everything they have if it is not able to make progress for some time	600s
scd-timeout	Timeout after which ClientReadStream considers a storage node down if it does not send any data for some time but the socket to it remains open.	300s

Rebuilding

Name	Description	Default	Notes
allow-conditional-rebuilding-restarts	Used to gate the feature described in T22614431. We want to enable it only after all clients have picked up the corresponding change in RSM protocol.	false	experimental, server only
auto-mark-unrecoverable-timeout	If too many storage nodes or shards are declared down by the failure detector or RocksDB, readers stall. If this setting is 'max' (default), readers remain stalled until some shards come back, or until the shards are marked _unrecoverable_ (permanently lost) by writing a special in the event log via an admin tool. If this setting contains a time value, upon the timer expiration the shards are marked unrecoverable automatically. This allows reader clients to skip over all records that could only be delivered by now unrecoverable shards, and continue reading more recent records.	max	server only
disable-data-log-rebuilding	If set then data logs are not rebuilt. This may be enabled for clusters with very low retention, where the probability of data-loss due to a 2nd or 3rd failure is low and the work done during rebuild interferes with the primary workload.	false	requires restart, server only
event-log-grace-period	grace period before considering event log caught up	10s	server only
event-log-max-delta-bytes	How many bytes of deltas to keep in the event log before we snapshot it.	10485760	server only
event-log-max-delta-records	How many delta records to keep in the event log before we snapshot it.	5000	server only
event-log-retention	How long to keep a history of snapshots and deltas for the event log. Unused if the event log has never been snapshotted or if event log trimming is disabled with disable-event-log-trimming.	14d	server only
event-log-snapshot-compression	Use ZSTD compression to compress event log snapshots	true
event-log-snapshotting	Allow the event log to be snapshotted onto a snapshot log. This requires the event log group to contain two logs, the first one being the snapshot log and the second one being the delta log.	true	requires restart
eventlog-snapshotting-period	Controls time based snapshotting. New eventlog snapshot will be created after this period if there are new deltas	1h	server only
filter-relocate-shards	Enables an optimization that mitigates a bias causing shards in RELOCATE mode to end up rebuilding 1/3rd of the data (assuming 3x replication). This setting will cause the server to use the FILTER_RELOCATE_SHARDS when triggering rebuilding. It is safe to enable/disable this setting while rebuilding is ongoing as this information is propagated in the event log for a given rebuilding version. Note: this setting does not affect the behavior of rebuildings triggered outside the server by external tools. These tools need to add the FILTER_RELOCATE_SHARDS flag for that purpose. Experimental.	true	server only
max-node-rebuilding-percentage	Do not initiate rebuilding if more than this percentage of storage nodes in the cluster appear to have been lost or have shards that appear to require rebuilding.	35	server only
planner-scheduling-delay	Delay between a shard rebuilding plan request and its execution to allow many shards to be grouped and planned together.	1min	server only
rebuild-dirty-shards	On start-up automatically rebuild LogsDB partitions left dirty by a prior unsafe shutdown of this node. This is called mini-rebuilding. The setting should be on unless you are running with --append-store-durability=sync_write, or don't care about data loss.	true	server only
rebuild-store-durability	The minimum guaranteed durability of rebuilding writes before a storage node will confirm the STORE as successful. Can be one of "memory", "async_write", or "sync_write". See --append-store-durability for a description of these options.	async_write	server only
rebuilding-dont-wait-for-flush-callbacks	Regardless of the value of 'rebuild-store-durability', assume any successfully completed store is durable without waiting for flush notifications. NOTE: Use of this setting will lead to silent under-replication when 'rebuild-store-durability' is set to 'MEMORY'. Use for testing and I/O characterization only.	false	requires restart, server only
rebuilding-global-window	the size of rebuilding global window expressed in units of time. The global rebuilding window is an experimental feature similar to the local window, but tracking rebuilding reads across all storage nodes in the cluster rather than per node. Whereas the local window improves the locality of reads, the global window is expected to improve the locality of rebuilding writes.	max	experimental, server only
rebuilding-local-window	Rebuilding will try to keep the difference between max and min in-flight records' timestamps less than this value. For best results, this should be considerably larger than rocksdb-partition-duration since records of each partition are rebuilt in a non-chronological order.	60min	server only
rebuilding-max-batch-bytes	max amount of data that a node can read in one batch for rebuilding	10M	server only
rebuilding-max-get-seq-state-in-flight	maximum number of 'get sequencer state' requests that a rebuilding donor node can have in flight at the same time. Every storage node participating in rebuilding gets the sequencer state for all logs residing on that node before beginning to re-replicate records. This is done in order to determine the LSN at which to stop rebuilding the log.	100	server only
rebuilding-max-record-bytes-in-flight	Maximum total size of rebuilding STORE requests that a rebuilding donor node can have in flight at the same time, per shard.	100M	server only
rebuilding-max-records-in-flight	Maximum number of rebuilding STORE requests that a rebuilding donor node can have in flight per shard at the same time.	200	server only
rebuilding-new-to-old	Rebuild records in order of approximately increasing age. More specifically, rebuilding iterates over partitions in reverse order, but within each partition it goes in order of increasing tuple [log ID, LSN]. If set to false, we iterate over partitions in old-to-new order. If global window is enabled, all nodes must have the same value of new-to-old setting; if they don't, rebuilding stalls until they do.	true	requires restart, server only
rebuilding-planner-sync-seq-retry-interval	retry interval for individual 'get sequencer state' requests issued by rebuilding via SyncSequencerRequest API, with exponential backoff	60s..5min	server only
rebuilding-rate-limit	Limit on how fast rebuilding reads, in bytes per unit of time, per shard. Example: 5M/1s will make rebuilding read at most one megabyte per second in each shard. Note that it counts pre-filtering bytes; if rebuilding has high read amplification (e.g. if copyset index is disabled or is not very effective because records are small), much fewer bytes per second will actually get re-replicated. Also note that this setting doesn't affect batch size; e.g. if --rebuilding-max-batch-bytes=10M and --rebuilding-rate-limit=1M/1s, rebuilding will probably read a 10 MB batch every 10 seconds.	unlimited	server only
rebuilding-restarts-grace-period	Grace period used to throttle how often rebuilding can be restarted. This protects the server against a spike of messages in the event log that would cause a restart.	20s	server only
rebuilding-store-timeout	Maximum timeout for attempts by rebuilding to store a record copy or amend a copyset on a specific storage node. This timeout only applies to stores and amends that appear to be in flight; a smaller timeout (--rebuilding-retry-timeout) is used if something is known to be wrong with the store, e.g. we failed to send the message, or we've got an unsuccessful reply, or connection closed after we sent the store.	240s..480s	server only
rebuilding-use-rocksdb-cache	Allow rebuilding reads to use RocksDB block cache. Rebuilding reads are not expected to benefit from using the cache, so it's disabled by default to avoid thrashing the cache.	false	server only
rebuilding-wait-purges-backoff-time	Retry timeout for waiting for local shards to purge a log before rebuilding it.	1s..10s	experimental, server only
reject-stores-based-on-copyset	If true, logdevice will prevent writes to nodes that are being drained (rebuilt in RELOCATE mode). Not recommended to set to false unless you're having a production issue.	true	server only
self-initiated-rebuilding-grace-period	grace period in seconds before triggering full node rebuilding after detecting the node has failed.	1200s	server only
shard-is-rebuilt-msg-delay	In large clusters SHARD_IS_REBUILT messages can arrive in a thundering herd overwhelming thread 0 processing those messages. The messages will be delayed by a random time in seconds between the min and the max value specified in the range above. 0 means no delay. NOTE: changing this value only applies to upcoming rebuildings, if you want to apply it to ongoing rebuildings, you'll need to restart the node.	5s..300s	server only

Recovery

Name	Description	Default	Notes
concurrent-log-recoveries	limit on the number of logs that can be in recovery at the same time	400	server only
enable-record-cache	Enable caching of unclean records on storage nodes. Used to minimize local log store access during log recovery.	true	requires restart, server only
get-erm-for-empty-epoch	If true, Purging will get the EpochRecoveryMetadata even if the epoch is empty locally	true	experimental, server only
max-active-cached-digests	maximum number of active cached digest streams on a storage node at the same time	2000	requires restart, server only
max-cached-digest-record-queued-kb	amount of RECORD data to push to the client at once for cached digesting	256	requires restart, server only
max-concurrent-purging-for-release-per-shard	max number of concurrently running purging state machines for RELEASE messages per each storage shard for each worker	4	requires restart, server only
mutation-timeout	initial timeout used during the mutation phase of log recovery to store enough copies of a record or a hole plug	500ms	server only
record-cache-max-size	Maximum size enforced for the record cache, 0 for unlimited. If positive and record cache size grows more than that, it will start evicting records from the cache. This is also the maximum total number of bytes allowed to be persisted in record cache snapshots. For snapshot limit, this is enforced per-shard with each shard having its own limit of (max_record_cache_snapshot_bytes / num_shards).	4294967296	server only
record-cache-monitor-interval	polling interval for the record cache eviction thread for monitoring the size of the record cache.	2s	server only
recovery-grace-period	Grace period time used by epoch recovery after it acquires an authoritative incomplete digest but wants to wait more time for an authoritative complete digest. Millisecond granularity. Can be 0.	100ms	server only
recovery-seq-metadata-timeout	Retry backoff timeout used for checking if the latest metadata log record is fully replicated during log recovery.	2s..60s	server only
recovery-timeout	epoch recovery timeout. Millisecond granularity.	120s	server only
single-empty-erm	A single E:EMPTY response for an epoch is sufficient for GetEpochRecoveryMetadataRequest to consider the epoch as empty if this option is set.	true	experimental, server only

Resource management

Name	Description	Default	Notes
append-stores-max-mem-bytes	Maximum total size of in-flight StoreStorageTasks from appenders and recoveries. Evenly divided among shards.	2G	server only
eagerly-allocate-fdtable	enables an optimization to eagerly allocate the kernel fdtable at startup	false	requires restart, server only
fd-limit	maximum number of file descriptors that the process can allocate (may require root privileges). If equal to zero, do not set any limit.	0	requires restart, server only
flow-groups-run-deadline	Maximum delay (plus one cycle of the event loop) between a request to run FlowGroups and Sender::runFlowGroups() executing.	5ms	server only
flow-groups-run-yield-interval	Maximum duration of Sender::runFlowGroups() before yielding to the event loop.	2ms	server only
lock-memory	On startup, call mlockall() to lock the text segment (executable code) of logdeviced in RAM.	false	requires restart, server only
max-inflight-storage-tasks	max number of StorageTask instances that one worker thread may have in flight to each database shard	4096	requires restart, server only
max-payload-size	The maximum payload size that will be accepted by the client library or the server. Can't be larger than 33554432 bytes.	1048576
max-record-read-execution-time	Maximum execution time for reading records. 'max' means no limit.	1s	experimental, server only
max-server-read-streams	max number of read streams clients can establish to the server, per worker	150000	server only
max-total-appenders-size-hard	Total size in bytes of running Appenders across all workers after which we start rejecting new appends.	629145600	server only
max-total-appenders-size-soft	Total size in bytes of running Appenders across all workers after which we start taking measures to reduce the Appender residency time.	524288000	server only
max-total-buffered-append-size	Total size in bytes of payloads buffered in BufferedWriters in sequencers for server-side batching and compression. Appends will be rejected when this threshold is significantly exceeded.	1073741824	server only
num-reserved-fds	expected number of file descriptors to reserve for use by RocksDB files and server-to-server connections within the cluster. This number is subtracted from --fd-limit (if set) to obtain the maximum number of client TCP connections that the server will be willing to accept.	0	requires restart, server only
per-worker-storage-task-queue-size	max number of StorageTask instances to buffer in each Worker for each local log store shard	1	requires restart, server only
queue-drop-overload-time	max time after worker's storage task queue is dropped before it stops being considered overloaded	1s	server only
queue-size-overload-percentage	percentage of per-worker-storage-task-queue-size that can be buffered before the queue is considered overloaded	50	server only
read-storage-tasks-max-mem-bytes	Maximum amount of memory that can be allocated by read storage tasks.	16106127360	server only
rebuilding-stores-max-mem-bytes	Maximum total size of in-flight StoreStorageTasks from rebuilding. Evenly divided among shards.	2G	server only
rocksdb-low-ioprio	IO priority to request for low-pri rocksdb threads. This works only if current IO scheduler supports IO priorities.See man ioprio_set for possible values. "any" or "" to keep the default.		requires restart, server only
slow-ioprio	IO priority to request for 'slow' storage threads. Storage threads in the 'slow' thread pool handle high-latency RocksDB IO requests, primarily data reads. Not all kernel IO schedulers supports IO priorities.See man ioprio_set for possible values."any" or "" to keep the default.		requires restart, server only

RocksDB

Name	Description	Default	Notes
iterator-cache-ttl	expiration time of idle RocksDB iterators in the iterator cache.	20s	server only
rocksdb-advise-random-on-open	if true, will hint the underlying file system that the file access pattern is random when an SST file is opened	false	requires restart, server only
rocksdb-allow-fallocate	If false, fallocate() calls are bypassed in rocksdb	true	requires restart, server only
rocksdb-arena-block-size	granularity of memtable allocations	4194304	requires restart, server only
rocksdb-block-size	approximate size of the uncompressed data block; rocksdb memory usage for index is around [total data size] / block_size * 50 bytes; on HDD consider using a much bigger value to keep memory usage reasonable	500K	requires restart, server only
rocksdb-bloom-bits-per-key	Controls the size of bloom filters in sst files. Set to 0 to disable bloom filters. "Key" in the bloom filter is log ID and entry type (data record, CSI entry or findTime index entry). Iterators then use this information to skip files that don't contain any records of the requested log. The default value of 10 corresponds to false positive rate of ~1%. Note that LogsDB already skips partitions that don't have the requested logs, so bloom filters only help for somewhat bursty write patterns - when only a subset of files in a partition contain a given log. However, even if appends to a log are steady, sticky copysets may make the streams of STOREs to individual nodes bursty.Another scenario where bloomfilters can be effective is during rebuilding. Rebuilding works a few logs at a time and if the (older partition) memtables are frequently flushed due to memory pressure then then they are likely to contain only a small number of logs in them.	10	server only
rocksdb-bloom-block-based	If true, rocksdb will use a separate bloom filter for each block of sst file. These small bloom filters will be at least 9 bytes each (even if bloom-bits-per-key is smaller). For data records, usually each block contains only one log, so the bloom filter size will be around max(72, bloom_bits_per_key) + 2 * bloom_bits_per_key per log per sst (the "2" corresponds to CSI and findTime index entries; if one or both is disabled, it's correspondingly smaller).	false	server only
rocksdb-bytes-per-sync	when writing files (except WAL), sync once per this many bytes written. 0 turns off incremental syncing, the whole file will be synced after it's written	1048576	requires restart, server only
rocksdb-bytes-written-since-throttle-eval-trigger	The maximum amount of buffered writes allowed before a forced throttling evaluation is triggered. This helps to avoid condition where too many writes come in for a shard, while flush thread is sleeping and we go over memory budget.	20M	server only
rocksdb-cache-high-pri-pool-ratio	Ratio of rocksdb block cache reserve for index and filter blocks if --rocksdb-cache-index-with-high-priority is enabled, and for small blocks if --rocksdb-cache-small-blocks-with-high-priority is positive.	0.0	requires restart, server only
rocksdb-cache-index	put index and filter blocks in the block cache, allowing them to be evicted	false	requires restart, server only
rocksdb-cache-index-with-high-priority	Cache index and filter block in high pri pool of block cache, making them less likely to be evicted than data blocks.	false	requires restart, server only
rocksdb-cache-numshardbits	This setting is not important. Width in bits of the number of shards into which to partition the uncompressed block cache. 0 to disable sharding. -1 to pick automatically. See rocksdb/cache.h.	4	requires restart, server only
rocksdb-cache-size	size of uncompressed RocksDB block cache	10G	requires restart, server only
rocksdb-cache-small-block-threshold-for-high-priority	SST blocks smaller than this size will get high priority (see --rocksdb-cache-high-pri-pool-ratio).	30K	server only
rocksdb-compaction-access-sequential	suggest to the OS that input files will be accessed sequentially during compaction	true	requires restart, server only
rocksdb-compaction-max-bytes-at-once	This is the unit for IO scheduling for compaction. It's used only if the DRR scheduler is being used. Each share received from the scheduler allows compaction filtering to proceed with these many bytes. If the scheduler is configured for request based scheduling (current default) each principal is allowed X number of requests based on its share and irrespective of the number of bytes for processed for each request. In this case it'll be the above bytes.	1048576	server only
rocksdb-compaction-ratelimit	limits how fast compactions can read uncompressed data, in bytes; format is /. Example: 5M/500ms means compaction will read 5MB per 500ms. This is applied to each compaction independently (e.g. if multiple shards are compacting simultaneously the total rate can be over the limit). Unlimited by default. IMPORTANT: This limits the rate of uncompressed data. If rocksdb compressed data 2X, the actual disk read rate will be around 1/2 of this limit.	30M/1s	server only
rocksdb-compaction-readahead-size	if non-zero, perform reads of this size (in bytes) when doing compaction; big readahead can decrease efficiency of compactions that remove a lot of records (compaction skips trimmed records using seeks)	4096	requires restart, server only
rocksdb-compaction-style	compaction style: 'universal' (default) or 'level'; if using 'level', also set --num-levels to at least 2	universal	requires restart, server only
rocksdb-compressed-cache-numshardbits	This setting is not important. Same as --rocksdb-cache-numshardbits but for the compressed cache (if enabled)	4	requires restart, server only
rocksdb-compressed-cache-size	size of compressed RocksDB block cache (0 to turn off)	0	requires restart, server only
rocksdb-compression-type	compression algorithm: 'lz4' (default), 'lz4hc', 'snappy', 'zlib', 'bzip2', 'zstd', 'none'	lz4	requires restart, server only
rocksdb-db-write-buffer-size	Soft limit on the total size of memtables per shard; when exceeded, oldest memtables will automatically be flushed. This may soon be superseded by a more global --rocksdb-memtable-size-per-node limit that should be set to <num_shards> * what you'd set this to. If you set this logdevice will no longer manage any flushes and all responsibility of flushing memtable is taken by rocksdb.	0	requires restart, server only
rocksdb-disable-iterate-upper-bound	disable iterate_upper_bound optimization in RocksDB	false	server only
rocksdb-enable-insert-hint	Enable rocksdb insert hint optimization. May reduce CPU usage for inserting keys into rocksdb, with small memory overhead.	true	requires restart, server only
rocksdb-enable-statistics	if set, instruct RocksDB to collect various statistics	true	requires restart, server only
rocksdb-first-key-in-index	If true, rocksdb sst file index will contain first key of each block. This reduces read amplification but increases index size.	false	requires restart, server only
rocksdb-flush-block-policy	Controls how RocksDB splits SST file data into blocks. 'default' starts a new block when --rocksdb-block-size is reached. 'each_log', in addition to what 'default' does, starts a new block when log ID changes. 'each_copyset', in addition to what 'each_log' does, starts a new block when copyset changes. Both 'each_' don't start a new block if current block is smaller than --rocksdb-min-block-size. 'each_log' should be safe to use in all cases. 'each_copyset' should only be used when sticky copysets are enabled with --enable-sticky-copysets (otherwise it would start a block for almost every record).	each_log	requires restart, server only
rocksdb-index-block-restart-interval	Number of keys between restart points for prefix encoding of keys in index blocks. Typically one of two values: 1 for no prefix encoding, 16 for prefix encoding (smaller memory footprint of the index).	16	requires restart, server only
rocksdb-index-shortening	Controls the precision of block boundaries in RocksDB sst file index. More shortening -> smaller indexes (i.e. less memory usage) but potentially worse iterator seek performance. Possible values are: 'none', 'shorten-separators', 'shorten-all'. Unless you're really low on memory, you should probably just use 'none' and not worry about it.There should be no reason to use 'shorten-all' - it saves a negligible amount of memory but makes iterator performance noticeably worse, especially with direct IO or insufficient block cache size. Deciding between 'none' and 'shorten-separators' is not straightforward, probably better to just do it experimentally, by looking memory usage and disk read rate. Also keep in mind that sst index size is approximately inversely proportional to --rocksdb-block-size or --rocksdb-min-block-size.	none	requires restart, server only
rocksdb-keep-log-file-num	number of RocksDB log files to retain. A periodic scan process in RocksDB reclaims old files.	100	requires restart, server only
rocksdb-ld-managed-flushes	If set to false (deprecated), decision about when and what memtables to flush is taken by rocksdb using it's internal policy. If set to true, all decisions about flushing are taken by logdevice. It uses rocksdb-memtable-size-per-node and rocksdb-write-buffer-size settings to decide if it's necessary to flush memtables. Requires enable rocksdb-memtable-size-per-node to be under 32GB, and db-write-buffer-size set to zero; otherwise, we silently fall back to rocksdb-managed flushes.	true	server only
rocksdb-level0-file-num-compaction-trigger	trigger L0 compaction at this many L0 files. This applies to the unpartitioned and metadata column families only, not to LogsDB data partitions.	10	requires restart, server only
rocksdb-level0-slowdown-writes-trigger	start throttling writers at this many L0 files. This applies to the unpartitioned and metadata column families only, not to LogsDB data partitions.	25	requires restart, server only
rocksdb-level0-stop-writes-trigger	stop accepting writes (block writers) at this many L0 files. This applies to the unpartitioned and metadata column families only, not to LogsDB data partitions.	30	requires restart, server only
rocksdb-log-readahead-size	The number of bytes to prefetch when reading the log (including the manifest). This is mostly useful for reading a remotely located log, as it can save the number of round-trips. If 0, then the prefetching is disabled.	0	requires restart, server only
rocksdb-low-pri-write-stall-threshold-percent	Node stalls rebuilding stores when sum of unflushed memory size and active memory size is above per shard memory limit, and active memory size goes beyond low_pri_write_stall_threshold_percent of per shard memory limit.	5	server only
rocksdb-max-background-compactions	Maximum number of concurrent rocksdb-initiated background compactions per shard. Note that this value is not important since most compactions are not "background" as far as rocksdb is concerned. They're done from _logsdb_ thread and are limited to one per shard at a time, regardless of this option.	2	requires restart, server only
rocksdb-max-background-flushes	maximum number of concurrent background memtable flushes per shard. Flushes run on the rocksdb hi-pri thread pool	1	requires restart, server only
rocksdb-max-bytes-for-level-base	maximum combined data size for L1	10G	requires restart, server only
rocksdb-max-bytes-for-level-multiplier	L_n -> L_n+1 data size multiplier	8	requires restart, server only
rocksdb-max-log-file-size	Max size of a RocksDB log file. A new file is created once the limit is reached. 0 for unlimited.	100M	requires restart, server only
rocksdb-max-open-files	maximum number of concurrently open RocksDB files; -1 for unlimited	10000	requires restart, server only
rocksdb-max-write-buffer-number	maximum number of concurrent write buffers getting flushed. Rocksdb stalls writes to the column family, on reaching this many flushed memtables. If ld_managed_flushes is true, this setting is ignored, and rocksdb is instructed to not stall writes, write throttling is done by LD based on shard memory consumption rather than number of memtables pending flush.	2	requires restart, server only
rocksdb-memtable-size-low-watermark-percent	low_watermark_percent is some percent of memtable_size_per_node and indicates the target consumption to reach if total consumption goes above memtable_size_per_node. Like memtable_size_per_node, low_watermark is sharded and individually applied to every shard. The difference between memtable_size_per_node and low_watermark should roughly match the size of metadata memtable while flusher thread was sleeping. Flushing extra has a big plus, metadata memtable flushes usually are few hundred KB or low MB value, and if difference between low_watermark and memtable_size_per_node is in order of tens of MB that makes dependent metadata memtable flushes almost free.	60	server only
rocksdb-memtable-size-per-node	soft limit on the total size of memtables per node; when exceeded, oldest memtable in the shard whose growth took the total memory usage over the threshold will automatically be flushed. This is a soft limit in the sense that flushing may fall behind or freeing memory be delayed for other reasons, causing us to exceed the limit. --rocksdb-db-write-buffer-size overrides this if it is set, but it will be deprecated eventually.	10G	experimental, server only
rocksdb-metadata-block-size	approximate size of the uncompressed data block for metadata column family (if --rocksdb-partitioned); if zero, same as --rocksdb-block-size	0	requires restart, server only
rocksdb-metadata-bloom-bits-per-key	Similar to --rocksdb-bloom-bits-per-key but for metadata column family. You probably don't want to enable this. This option is here just for completeness. It's not expected to have any positive effect since almost all reads from metadata column family bypass bloom filters (with total_order_seek = true).	0	server only
rocksdb-metadata-cache-numshardbits	This setting is not important. Same as --rocksdb-cache-numshardbits but for the metadata cache	4	requires restart, server only
rocksdb-metadata-cache-size	size of uncompressed RocksDB block cache for metadata	1G	requires restart, server only
rocksdb-min-block-size	minimum size of the uncompressed data block; only used when --rocksdb-flush-block-policy is not default; on SSD consider reducing this value	16384	requires restart, server only
rocksdb-min-manual-flush-interval	Deprecated after introduction of ld_manged_flushes. Checkout rocksdb-flush-trigger-check-interval to control time interval between flush trigger checks. How often a background thread will flush buffered writes if either the data age, partition idle, or data amount triggers indicate a flush should occur. 0 disables all manual flushes.	120s	server only
rocksdb-num-bg-threads-hi	Number of high-priority rocksdb background threads to run. These threads are shared among all shards. If -1, num_shards * max_background_flushes is used.	-1	requires restart, server only
rocksdb-num-bg-threads-lo	Number of low-priority rocksdb background threads to run. These threads are shared among all shards. If -1, num_shards * max_background_compactions is used.	-1	requires restart, server only
rocksdb-num-levels	number of LSM-tree levels if level compaction is used	1	requires restart, server only
rocksdb-paranoid-checks	If true, RocksDB will aggressively check consistency of the data. Also, if any of the writes to the database fails (Put, Delete, Merge, Write), the database will switch to read-only mode and fail all other Write operations. In most cases you want this to be set to true.	true	requires restart, server only
rocksdb-partition-data-age-flush-trigger	Maximum wait after data are written before being flushed to stable storage. 0 disables the trigger.	1200s	server only
rocksdb-partition-idle-flush-trigger	Maximum wait after writes to a time partition cease before any uncommitted data are flushed to stable storage. 0 disables the trigger.	600s	server only
rocksdb-read-amp-bytes-per-bit	If greater than 0, will create a bitmap to estimate rocksdb read amplification and expose the result through READ_AMP_ESTIMATE_USEFUL_BYTES and READ_AMP_TOTAL_READ_BYTES stats.	32	requires restart, server only
rocksdb-sample-for-compression	If set then 1 in N rocksdb blocks will be compressed to estimate compressibility of data. This is just used for stats collection and helpful to determine whether compression will be beneficial at the rocksdb level or any other level. Two stat values are updated: sampled_blocks_compressed_bytes_fast and sampled_blocks_compressed_bytes_slow. One for a fast compression algo like lz4 and other other for a high compression algo like zstd. The stored data is left uncompressed. 0 means no sampling.	20	requires restart, server only
rocksdb-skip-list-lookahead	number of keys to examine in the neighborhood of the current key when searching within a skiplist (0 to disable the optimization)	3	requires restart, server only
rocksdb-sst-delete-bytes-per-sec	ratelimit in bytes/sec on deletion of SST files per shard; 0 for unlimited.	100000000	server only
rocksdb-table-format-version	Version of rockdb block-based sst file format. See rocksdb/table.h for details. You probably don't need to change this.	4	requires restart, server only
rocksdb-target-file-size-base	target L1 file size for compaction	67108864	requires restart, server only
rocksdb-uc-max-merge-width	maximum number of files in a single universal compaction run	4294967295	requires restart, server only
rocksdb-uc-max-size-amplification-percent	target size amplification percentage for universal compaction	200	requires restart, server only
rocksdb-uc-min-merge-width	minimum number of files in a single universal compaction run	2	requires restart, server only
rocksdb-uc-size-ratio	arg is a percentage. If the candidate set size for compaction is arg% smaller than the next file size, then include next file in the candidate set.	1M	requires restart, server only
rocksdb-update-stats-on-db-open	load stats from property blocks of several files when opening the database in order to optimize compaction decisions. May significantly impact the time needed to open the db.	false	requires restart, server only
rocksdb-use-direct-io-for-flush-and-compaction	If true, rocksdb will use O_DIRECT for flushes and compactions (both input and output files).	false	requires restart, server only
rocksdb-use-direct-reads	If true, rocksdb will use O_DIRECT for most file reads.	false	requires restart, server only
rocksdb-use-nocachedump-memory-allocator	If true, rocksdb will exclude block cache in core dumps by using the new JemallocNodumpAllocator memory allocator option	false	server only
rocksdb-wal-bytes-per-sync	when writing WAL, sync once per this many bytes written. 0 turns off incremental syncing	1M	requires restart, server only
rocksdb-writable-file-max-buffer-size	Buffer size rocksdb will use when writing files. Applies to sst files, and likely to some metadata files like MANIFEST and OPTIONS. This memory is not allocated all at once, the buffer grows exponentially up to this size; so it's ok for this setting to be too high.	16M	requires restart, server only
rocksdb-write-buffer-size	When any RocksDB memtable ('write buffer') reaches this size it is made immutable, then flushed into a newly created L0 file. This setting may soon be superseded by a more dynamic --memtable-size-per-node limit.	100G	server only
rocksdb-write-copyset-index	If set, storage nodes will write the copyset index for all records in new partitions created after this is enabled. Note that this won't be used until --rocksdb-use-copyset-index is enabled.	true	requires restart, server only
write-copyset-index	If set, storage nodes will write the copyset index for all records in new partitions created after this is enabled. Note that this won't be used until --rocksdb-use-copyset-index is enabled.	true	requires restart, server only

Security

Name	Description	Default	Notes
require-permission-message-types	Check permissions only for the received message of the type(s) specified. Separate different types with a comma. 'all' to apply to all messages. Prefix the value with '~' to include all types except the given ones, e.g. '~WINDOW,RELEASE' will check permissions for messages of all types except WINDOW and RELEASE.	START	server only
require-ssl-on-command-port	Requires SSL for admin commands sent to the command port. --ssl-cert-path, --ssl-key-path and --ssl-ca-path settings must be properly configured	false	experimental, server only
ssl-boundary	Enable SSL in cross-X traffic, where X is the setting. Example: if set to "rack", all cross-rack traffic will be sent over SSL. Can be one of "none", "node", "rack", "row", "cluster", "data_center" or "region". If a value other than "none" or "node" is specified on the client, --my-location has to be specified as well.	none
ssl-ca-path	Path to CA certificate.		requires restart
ssl-cert-path	Path to LogDevice SSL certificate.		requires restart
ssl-cert-refresh-interval	TTL for an SSL certificate that we have loaded from disk.	300s	requires restart
ssl-key-path	Path to LogDevice SSL key.		requires restart
ssl-load-client-cert	Set to include client certificate for mutual ssl authentication	false	requires restart, client only
ssl-on-gossip-port	If true, gossip port will reject all plaintext connections. Only SSL connections will be accepted. WARNING: Any change to this setting should only be performed while send-to-gossip-port = false, in order to avoid failure detection issues while the setting change propagates through the cluster.	false	server only
ssl-use-session-resumption	If enabled, new SSL connections will attempt to resume previously cached sessions.	false

Sequencer State

Name	Description	Default	Notes
check-seal-req-min-timeout	before a sequencer returns its state in response to a 'get sequencer state' request the sequencer checks that it is the most recent (highest numbered) sequencer for the log. It performs the check by sending a 'check seal' request to a valid copyset of nodes in the nodeset of the sequencer's epoch. The 'check seal' request looks for a seal record placed by a higher-numbered sequencer. This setting sets the timeout for 'check seal' requests. The timeout is set to the smaller of this value and half the value of --seq-state-reply-timeout.	500ms	server only
enable-health-based-sequencer-placement	Toggle use of HealthMonitor determined node status in sequencer location. This value MUST be the same on client and server to ensure correct conditions for health based sequencer placement.	false
epoch-draining-timeout	Maximum time allowed for sequencer to drain one epoch. Sequencer will abort draining the epoch if it takes longer than the timeout. A sequencer 'drains' its epoch (waits for all appenders to complete) while reactivating to serve a higher epoch.	2s	server only
get-trimpoint-interval	polling interval for the sequencer getting the trim point from all storage nodes	600s	server only
maximum-percent-unhealthy-seq-nodes-hbsp	Percent of UNHEALTHY nodes in the cluster at which HealthBasedHashing is no longer a viable option. This value MUST be the same on client and server to ensure correct conditions for health based sequencer placement.	0.5
metadata-log-trim-interval	How often periodic trimming of metadata logs should run. Zero value prevents it from running at all.	7200s	server only
metadata-log-trim-timeout	Timeout when waiting for periodic metadata log trim request to be completed	120s	server only
nodeset-adjustment-min-window	When automatic nodeset size adjustment is enabled, only do the adjustment if we've got append throughput information for at least this period of time. More details: we choose nodeset size based on log's average append throughput in a moving window of size --nodeset-adjustment-period. The average is maintained by the sequencer. If the sequencer was activated recently, we may not have a good estimate of log's append throughput. This setting says how long to wait after sequencer activation before allowing adjusting nodeset size based on that sequencer's throughput.	1h	server only
nodeset-adjustment-period	If not zero, nodeset size for each log will be periodically adjusted based on logs's measured throughput. This settings controls how often such adjustments will be considered. The nodeset size is chosen proportionally to throughput, replication factor and backlog duration. The nodeset_size log attribute acts as the minimum allowed nodeset size, used for low-throughput logs and logs with infinite backlog duration. If --nodeset-adjustment-period is changed from nonzero to zero, all adjusted nodesets get immediately updated back to normal size.	6h	server only
nodeset-adjustment-target-bytes-per-shard	When automatic nodeset size adjustment is enabled, (--nodeset-adjustment-period), this setting controls the size of the chosen nodesets. The size is chosen so that each log takes around this much space on each shard. More precisely, `nodeset\_size = append\_bytes\_per\_sec * backlog\_duration * replication\_factor / nodeset\_adjustment\_target\_bytes\_per\_shard`. Appropriate value for this setting is around 0.1% - 1% of disk size.	10G	server only
nodeset-max-randomizations	When automatic nodeset size adjustment wants to enlarge nodeset to unreasonably big size N > 127, we instead set nodeset size to 127 but re-randomize the nodeset min(N/127, nodeset_max_randomizations) times during retention period. If you make it too big, the union of historical nodesets will get big (127 * n), and findTime, isLogEmpty etc may become expensive. If you set it too small, and the cluster has high-throughput high-retention logs, space usage may be not very balanced.	4	server only
nodeset-size-adjustment-min-factor	When automatic nodeset size adjustment is enabled, we skip adjustments that are smaller than this factor. E.g. if this setting is set to 2, we won't bother updating nodeset if its size would increase or decrease by less than a factor of 2. If set to 0, nodesets will be unconditionally updated every --nodeset-adjustment-period, and will also be randomized each time, as opposed to using consistent hashing.	2	server only
reactivation-limit	Maximum allowed rate of sequencer reactivations. When exceeded, further appends will fail.	5/1s	requires restart, server only
read-historical-metadata-timeout	maximum time interval for a sequencer to get historical epoch metadata through reading the metadata log before retrying.	10s	server only
seq-state-backoff-time	how long to wait before resending a 'get sequencer state' request after a timeout.	1s..10s
seq-state-reply-timeout	how long to wait for a reply to a 'get sequencer state' request before retrying (usually to a different node)	2s
update-metadata-map-interval	Sequencer has a timer for periodically reading metadata logs and refreshing the in memory metadata_map_. This setting specifies the interval for this timer	1h	server only
write-streams-map-clear-size	Clear size for write streams map in each epoch sequencer. Once size exceeds write-streams-map-max-capacity, write-streams-map-clear-size number of least-recently-used write streams are evicted.	100	server only
write-streams-map-max-capacity	Maximum capacity of write streams map in each epoch sequencer. Once size exceeds write-streams-map-max-capacity, write-streams-map-clear-size number of least-recently-used write streams are evicted.	1000	server only

Sequencer boycotting

Name	Description	Default	Notes
node-stats-boycott-adaptive-duration-decrease-rate	(experimental) the additive decrease rate of the adaptive boycotting duration	1min	experimental, server only
node-stats-boycott-adaptive-duration-decrease-time-step	(experimental) the time step of the decrease of the adaptive boycotting duration	30s	experimental, server only
node-stats-boycott-adaptive-duration-increase-factor	(experimental) the multiplicative increase factor of the adaptive boycotting duration	2	experimental, server only
node-stats-boycott-duration	How long a boycott should be active for. 0 will ensure that boycotts has no effect, but controller nodes will still report outliers	30min	server only
node-stats-boycott-grace-period	If a node is an consecutively deemed an outlier for this amount of time, allow it to be boycotted	300s	server only
node-stats-boycott-max-adaptive-duration	(experimental) The maximum adaptive boycotting duration	24h	experimental, server only
node-stats-boycott-min-adaptive-duration	(experimental) The minimum (and default) adaptive boycotting duration	30min	experimental, server only
node-stats-boycott-relative-margin	If this is set to 0.05, a node's append success ratio has to be 5% smaller than the average success ratio of all nodes in the cluster. While node-stats-boycott-sensitivity is an absolute threshold, this setting defines a sensitivity threshold relative to the average of all success ratios.	0.15	server only
node-stats-boycott-required-client-count	Require at least values from this many clients before a boycott may occur	1	server only
node-stats-boycott-required-std-from-mean	A node has to have a success ratio lower than (mean - X * STD) to be considered an outlier. X being the value of node-stats-boycott-required-std-from-mean	3	server only
node-stats-boycott-sensitivity	If node-stats-boycott-sensitivity is set to e.g. 0.05, then nodes with a success ratio at or above 95% will not be boycotted	0	server only
node-stats-boycott-use-adaptive-duration	(experimental) Use the new adaptive boycotting durations instead of the fixed one	false	experimental, server only
node-stats-controller-aggregation-period	The period at which the controller nodes requests stats from all nodes in the cluster. Should be smaller than node-stats-retention-on-nodes	30s	server only
node-stats-controller-check-period	A node will check if it's a controller or not with the given period	60s	server only
node-stats-controller-response-timeout	A controller node waits this long between requesting stats from the other nodes, and aggregating the received stats	2s	server only
node-stats-max-boycott-count	How many nodes may be boycotted. 0 will in addition to not allowing any nodes to be boycotted, it also ensures no nodes become controller nodes	0	server only
node-stats-remove-worst-percentage	Will throw away the worst X% of values reported by clients, to a maximum of node-count * node-stats-send-worst-client-count	0.2	server only
node-stats-retention-on-nodes	Save node stats sent from the clients on the nodes for this duration	300s	requires restart, server only
node-stats-send-period	Send per-node stats into the cluster with this period. Currently only 30s of stats is tracked on the clients, so a value above 30s will not have any effect.	15s	client only
node-stats-send-retry-delay	When sending per-node stats into the cluster, and the message failed, wait this much before retrying.	5ms..1s	requires restart, client only
node-stats-send-worst-client-count	Once a node has aggregated the values sent from writers, there may be some amount of writers that are in a bad state and report 'false' values. By setting this value, the `node-stats-send-worst-client-count` worst values reported by clients per node will be sent separately to the controller, which can then take a decision if the writer is functioning correctly or not.	20	server only
node-stats-timeout-delay	Wait this long for an acknowledgement that the sent node stats message was received before sending the stats to another node	2s	client only

State machine execution

Name	Description	Default	Notes
background-queue-size	Maximum number of events we can queue to background thread. A single queue is shared by all threads in a process.	100000	requires restart
execute-requests	number of HI_PRI requests to process per worker event loop iteration	13
lo_requests_per_iteration	number of LO_PRI requests to process per worker event loop iteration	1
mid_requests_per_iteration	number of MID_PRI requests to process per worker event loop iteration	2
num-background-workers	The number of workers dedicated for processing time-insensitive requests and operations	4	requires restart, server only
num-processor-background-threads	Number of threads in Processor's background thread pool. Background threads are used by, e.g., BufferedWriter to construct/compress large batches. If 0 (default), use num-workers.	0	requires restart
num-workers	number of worker threads to run, or "cores" for one thread per CPU core	cores	requires restart
prioritized-task-execution	Enable prioritized execution of requests within CPU executor. Setting this false ignores per request and per message ExecutorPriority.	true	requires restart
test-mode	Enable functionality in integration tests. Currently used for admin commands that are only enabled for testing purposes.	false	CLI only, requires restart, server only
worker-request-pipe-capacity	size each worker request queue to hold this many requests	524288	requires restart

Storage

Name	Description	Default	Notes
free-disk-space-threshold	threshold (relative to total diskspace) of minimal free disk space for storage partitions to accept writes. This should be a fraction between 0.0 (exclusive) and 1.0 (exclusive). Storage nodes will reject writes to partitions with free disk space less than the threshold in order to guarantee that compactions can be performed. Note: this only applies to RocksDB local storage.	0.2	server only
ignore-cluster-marker	If cluster marker is missing or doesn't match, overwrite it and carry on. Cluster marker is a file that LogsDB writes in the data directory of each shard to identify the shard id, node id, and cluster name to which the data in that directory belongs. Cluster marker mismatch indicates that a drive or node was moved to another cluster or another shard, and the data must not be used.	false	requires restart, server only
local-log-store-path	path to local log store (if storage node)		requires restart, server only
logstore-monitoring-interval	interval between consecutive health checks on the local log store	10s	server only
max-in-flight-monitor-requests	maximum number of in-flight monitoring requests (e.g. manual compaction) posted by the monitoring thread	1	requires restart, server only
max-queued-monitor-requests	max number of log store monitor requests buffered in the monitoring thread queue	32	requires restart, server only
rocksdb-auto-create-shards	Auto-create shard data directories if they do not exist	false	requires restart, server only
storage-task-read-backlog-share	The share for principal read-backlog in the DRR scheduler.	5	server only
storage-task-read-compaction-partial-share	The share for principal read-compaction-partial in the DRR scheduler.	1	server only
storage-task-read-compaction-retention-share	The share for principal read-compaction-retention in the DRR scheduler.	3	server only
storage-task-read-findkey-share	The share for principal read-findkey in the DRR scheduler.	10	server only
storage-task-read-internal-share	The share for principal read-internal in the DRR scheduler.	10	server only
storage-task-read-metadata-share	The share for principal read-metadata in the DRR scheduler.	10	server only
storage-task-read-misc-share	The share for principal read-misc in the DRR scheduler.	1	server only
storage-task-read-rebuild-share	The share for principal read-rebuild in the DRR scheduler.	3	server only
storage-task-read-tail-share	The share for principal read-tail in the DRR scheduler.	8	server only
storage-tasks-drr-quanta	Default quanta per-principal. 1 implies request based scheduling. Use something like 1MB for byte based scheduling.	1	server only
storage-tasks-use-drr	Use DRR for scheduling read IO's.	false	requires restart, server only
storage-thread-delaying-sync-interval	Interval between invoking syncs for delayable storage tasks. Ignored when undelayable task is being enqueued.	100ms	server only
storage-threads-per-shard-default	size of the storage thread pool for small client requests and metadata operations, per shard. If zero, the 'slow' pool will be used for such tasks.	2	requires restart, server only
storage-threads-per-shard-fast	size of the 'fast' storage thread pool, per shard. This storage thread pool executes storage tasks that write into RocksDB. Such tasks normally do not block on IO. If zero, slow threads will handle write tasks.	2	requires restart, server only
storage-threads-per-shard-fast-stallable	size of the thread pool (per shard) executing low priority write tasks, such as writing rebuilding records into RocksDB. Measures are taken to not schedule low-priority writes on this thread pool when there is work for 'fast' threads. If zero, normal fast threads will handle low-pri write tasks	1	requires restart, server only
storage-threads-per-shard-slow	size of the 'slow' storage thread pool, per shard. This storage thread pool executes storage tasks that read log records from RocksDB, both to serve read requests from clients, and for rebuilding. Those are likely to block on IO.	2	requires restart, server only
write-batch-bytes	min number of payload bytes for a storage thread to write in one batch unless write-batch-size is reached first	1048576	server only
write-batch-size	max number of records for a storage thread to write in one batch	1024	server only

Testing

Name	Description	Default	Notes
abort-on-failed-catch	When an ld_catch() fails, call abort(). If not, just continue executing. We'll log either way.	`true` in debug builds, `false` in release builds
abort-on-failed-check	When an ld_check() fails, call abort(). If not, just continue executing. We'll log either way.	`false` in the client, `true` elsewhere
assert-on-data	Trigger asserts on data in RocksDB (or that received from the network). Should not be used in prod.	false	server only
block-eventlog-rsm	If true, the EventLog replicated state machine will not publish any state updates. This simulates the case where we cannot finish loading the state on startup. Changing the value will cause the RSM to publish the state immediately if it can.	false	server only
block-logsconfig-rsm	If true, the LogsConfig replicated state machine will not publish any state updates. This simulates the case where we cannot finish loading the state on startup. Changing the value will cause the RSM to publish the state immediately if it can.	false
block-maintenance-rsm	If true, the Maintenance replicated state machine will not publish any state updates. This simulates the case where we cannot finish loading the state on startup. Changing the value will cause the RSM to publish the state immediately if it can.	false	server only
client-test-force-stats	force instantiation of StatsHolder within ClientImpl even if stats publishing is disabled	false	requires restart, client only
disable-event-log-trimming	Disable trimming of the event log (for tests only)	false	server only
disable-logsconfig-trimming	Disable the trimming of logsconfig delta log. Used for testing only.	false	server only
disable-rebuilding	Disable rebuilding. Do not use in production. Only used by tests.	false	requires restart, server only
disable-trim-past-tail-check	Disable check for trim past tail. Used for testing log trimming.	false	client only
dont-serve-findtimes-for-logs	Logs for which findtimes will not be served		server only
dont-serve-findtimes-status	status that should be returned for logs that are in "dont-serve-findtimes-for-logs"	FAILED	server only
dont-serve-reads-for-logs	Logs for which reads will not be served		server only
dont-serve-reads-status	status that should be returned for logs that are in "dont-serve-reads-for-logs"	FAILED	server only
dont-serve-stores-for-logs	Logs for which stores will not be served		server only
dont-serve-stores-status	status that should be returned for logs that are in "dont-serve-stores-for-logs"	FAILED	server only
epoch-store-path	directory containing epoch files for logs (for testing only)		requires restart, server only
esn-bits	How many bits to use for sequence numbers within an epoch. LSN bits [n, 32) are guaranteed to be 0. Used for testing ESN exhaustion.	32	requires restart, server only
hold-store-replies	If set, we hold all STORED messages (which are replies to STORE messages), until the last one comes is. Has some race conditions and other down sides, so only use in tests. Used to ensure that all storage nodes have had a chance to process the STORE messages, even if one returns PREEMPTED or another error condition.	false	requires restart, server only
include-cluster-name-on-handshake	The cluster name of the connection initiator will be included in the LogDevice protocol handshake. If the cluster name of the initiator does not match the actual cluster name of the destination, the connection is terminated. We don't know of any good reasons to disable this option. If you disable it and move some hosts from one cluster to another, you may have a bad time: some clients or servers may not pick up the update and keep talking to the hosts as if they weren't moved; this may corrupt metadata. Used for testing and internally created connections only.	true
loglevel-overrides	Comma-separated list of :. eg: Server.cpp:debug,Sequencer.cpp:notify		server only
logsconfig-api-blacklist-nodes	Comma-separated list of indices of nodes that shouldn't be picked for executing logs config API requests (e.g. makeDirectory()). Used in tests.		client only
msg-error-injection-chance	percentage chance of a forced message error on a Socket. Used to exercise error handling paths.	0	requires restart, server only
msg-error-injection-status	Status that should be returned for a simulated message transmission error. Some values are treated in special ways: CBREGISTERED pretends that the message was delayed by traffic shaping (only if traffic shaping is enabled); DROPPED makes Sender to pretend that the message is in-flight indefinitely, without ever sending it.	NOBUFS	requires restart
nodes-configuration-file-store-dir	If set, the source of truth of nodes configuration will be under this dir instead of the default (zookeeper) store. Only effective when --enable-nodes-configuration-manager=true; Used by integration testing.		requires restart
rebuild-without-amends	During rebuilding, send a normal STORE rather than a STORE with the AMEND flag, when updating the copyset of nodes that already have a copy of the record. This option is used by integration tests to fully divorce append content from records touched by rebuilding.	false	server only
rebuilding-read-only	Rebuilding is read-only (for testing only). Use on-donor if rebuilding should not send STORE messages, or on-recipient if these should be sent but discarded by the recipient (LogsDB only)	none	server only
rebuilding-retry-timeout	Delay before a record rebuilding retries a failed operation	5s..30s	server only
rocksdb-test-corrupt-stores	Used for testing only. If true, a node will report all stores it receives as corrupted.	false	server only
rocksdb-test-stall-sst-reads	Used for testing. If set to 'true', rocksdb::RandomAccessFile::Read() calls will stall indefinitely, until the setting is changed to 'false'. This simulates a file read getting stuck. Can be used for testing resilience of read path to such situation.	false	server only
skip-recovery	Skip recovery. For tests only. When this option is enabled, recovery does not recover any data but instead immediately marks all epochs as clean in the epoch store and purging immediately marks all epochs as clean in the local log store. This feature should be used as a last resort if a cluster's availability is hurt by recovery and it is important to quickly restore availability at the cost of some inconsistencies. On-the-fly changes of this setting will only apply to new LogRecoveryRequests and will not affect recoveries that are already in progress.	false	server only
test-appender-skip-stores	Allow appenders to skip sending data to storage node. Currently used in tests to make sure an appender state machine is running	false	server only
test-bypass-recovery	If set, sequencers will not automatically run recovery upon activation. Recovery can be started using the 'startrecovery' admin command. Note that last released lsn won't get advanced without recovery.	false	requires restart, server only
test-do-not-pick-in-copysets	Copyset selectors won't pick these nodes. Comma-separated list of node indexes, e.g. '1,2,3'. Used in tests.		server only
test-get-cluster-state-recipients	Force get-cluster-state recipients as a comma-separated list of node ids		client only
test-reject-hello	if set to the name of an error code, reject all HELLOs with the specified error code. Currently supported values are ACCESS and PROTONOSUPPORT. Used for testing.	OK	requires restart, server only
test-same-partition-nodes	Used for isolation testing. Only nodes in this set will be addressable from this node. An empty list disables this error injection.		server only
test-sequencer-corrupt-stores	Simulates bad hardware flipping a bit in the payload of a STORE message.	false	server only
test-stall-rebuilding	Makes rebuilding pretend to start but not make any actual progress. Used in tests.	false	server only
test-timestamp-linear-transform	Coefficients for transforming the timestamp of records for test. The value should contain two integers separated by ','. For example 'm,c'. Records timestamp is transformed as m * now() + c.A default value of '1,0' makes the timestamp = now() which is expected for all the normal use cases.	1,0	requires restart, server only
unix-socket	Path to the unix domain socket the server will use to listen for non-SSL clients		CLI only, requires restart, server only
watchdog-detected-worker-stall-error-injection-chance	Percentage chance of detection of stalled workers in watchdog thread. Used to exercise error handling paths in Health Monitor.	0	requires restart, server only
worker-queue-stall-error-injection-chance	Percentage chance of delayed request queuing in a worker thread. Used to exercise error handling paths in Health Monitor.	0	requires restart, server only
worker-stall-error-injection-chance	Percentage chance of delayed request execution in a worker thread. Used to exercise error handling paths in Health Monitor.	0	requires restart, server only

Uncategorized

Name	Description	Default	Notes
rebuilding-max-batch-time	Max amount of time rebuilding read storage task is allowed to take before yielding to other storage tasks. "max" for unlimited.	1000ms	server only
rebuilding-max-malformed-records-to-tolerate	Controls how rebuilding donors handle unexpected values in local log store (e.g. caused by bugs, forward incompatibility, or other processes writing unexpected things to rocksdb directly).If rebuilding encounters invalid records, it skips them and logs warnings. But if it encounters at least this many of them, it freaks out, logs a critical error and stalls indefinitely. The rest of the server keeps trying to run normally, to the extent to which you can run normally when you can't parse most of the records in the DB.	1000	requires restart, server only
sync-metadata-log-writes	If set, storage nodes will wait for wal sync of metadata log writes before sending the STORED ack.	true	server only

Write path

Name	Description	Default	Notes
append-store-durability	The minimum guaranteed durability of record copies before a storage node confirms the STORE as successful. Can be one of "memory" if record is to be stored in a RocksDB memtable only (logdeviced memory), "async_write" if record is to be additionally written to the RocksDB WAL file (kernel memory, frequently synced to disk), or "sync_write" if the record is to be written to the memtable and WAL, and the STORE acknowledged only after the WAL is synced to disk by a separate WAL syncing thread using fdatasync(3).	async_write	server only
appender-buffer-process-batch	batch size for processing per-log queue of pending writes	20	server only
appender-buffer-queue-cap	capacity of per-log queue of pending writes while sequencer is initializing or activating	10000	requires restart, server only
byte-offsets	Enables the server-side byte offset calculation feature. NOTE: There is no guarantee of byte offsets result correctness if featurewas switched on->off->on in period shorter than retention value forlogs.	false	server only
check-node-health-request-timeout	Timeout for health check probes that sequencers send to unresponsive storage nodes. If no reply arrives after timeout, another probe is sent.	120s	server only
checksum-bits	how big a checksum to include with newly appended records (0, 32 or 64)	32
copyset-locality-min-scope	Tl;dr: if you experience data distribution imbalance caused by hot logs, and you have plenty of unused cross-rack/cross-region bandwidth, try changing this setting to "root"; otherwise the default "rack" is just fine. More details: let X be the value of this setting, and let Y be the biggest scope in log's replicateAcross property; if Y < X, nothing happens; if Y >= X, at least one copy of each record will be stored in sequencer's domain of scope Y (not X), when it's possible without affecting average data distribution. This, combined with chain-sending, typically reduces the number of cross-Y hops by one per record.	rack	server only
disable-chain-sending	never send a wave of STORE messages through a chain	false	server only
disable-graylisting	setting this to true disables graylisting nodes by sequencers in the write path	false	server only
disable-outlier-based-graylisting	setting this to true disables the outlier based graylisting nodes by sequencers in the write path	true	experimental, server only
disabled-retry-interval	Time interval during which a sequencer will not route record copies to a storage node that reported a permanent error.	30s	server only
enable-adaptive-store-timeout	decides whether to enable an adaptive store timeout	false	experimental, server only
enable-offset-map	Enables the server-side OffsetMap calculation feature. NOTE: There is no guarantee of byte offsets result correctness if featurewas switched on->off->on in period shorter than retention value for logs.	false	server only
enable-sticky-copysets	If set, sequencers will enable sticky copysets. Doesn't affect the copyset index.	true	requires restart, server only
epoch-metadata-use-new-storage-set-format	Serialize copysets using ShardIDs instead of node_index_t inside EpochMetaData. TODO(T15517759): enable by default once Flexible Log Sharding is fully implemented and this has been thoroughly tested.	false	experimental
gray-list-threshold	if the number of storage nodes graylisted on the write path of a log exceeds this fraction of the log's nodeset size the gray list will be cleared to make sure that copysets can still be picked	0.25	server only
graylisting-grace-period	The duration through which a node need to be consistently an outlier to get graylisted	300s	server only
graylisting-min-latency	Don't graylist nodes that have p95 store latency less than this.	0ms	server only
graylisting-monitored-period	The duration through which a recently ungraylisted node will be monitored and graylisted as soon as it becomes an outlier	120s	server only
graylisting-refresh-interval	The interval at which the graylists are refreshed	30s	server only
isolated-sequencer-ttl	How long we wait before disabling isolated sequencers. A sequencer is declared isolated if nodes outside of the innermost failure domain of the sequencer's epoch appear unreachable to the failure detector. For example, a sequencer of a rack-replicated log epoch is declared isolated if the failure detector can't reach any nodes outside of that sequencer node's rack. A disabled sequencer rejects all append requests.	1200s	server only
no-redirect-duration	when a sequencer activates upon request from a client, it does not redirect its clients to a different sequencer node for this amount of time (even if for instance the primary sequencer just started up and an older sequencer may be up and running)	5s	server only
node-health-check-retry-interval	Time interval during which a node health check probe will not be sent if there is an outstanding request for the same node in the nodeset	5s	server only
nodeset-state-refresh-interval	Time interval that rate-limits how often a sequencer can refresh the states of nodes in the nodeset in use	1s	server only
nospace-retry-interval	Time interval during which a sequencer will not route record copies to a storage node that reported an out of disk space condition.	60s	server only
overloaded-retry-interval	Time interval during which a sequencer will not route record copies to a storage node that reported itself overloaded (storage task queue too long).	1s	server only
release-broadcast-interval	the time interval for periodic broadcasts of RELEASE messages by sequencers of regular logs. Such broadcasts are not essential for correct cluster operation. They are used as the last line of defence to make sure storage nodes deliver all records eventually even if a regular (point-to-point) RELEASE message is lost due to a TCP connection failure. See also --release-broadcast-interval-internal-logs.	300s	server only
release-broadcast-interval-internal-logs	Same as --release-broadcast-interval but instead applies to internal logs, currently the event logs and logsconfig logs	5s	server only
release-retry-interval	RELEASE message retry period	20s	server only
shadow-client-creation-retry-interval	Failed shadow appends because shadow client was not available, enqueue a client recreation request. The retry mechanism retries the enqueued attempt after these many seconds. See ShadowClient.cpp for a detailed explanation. 0 disables the retry feature. 1 silently drops all client creations so that they only get created from the retry path.	60s	client only
slow-node-retry-interval	After a sequencer's request to store a record copy on a storage node times out that sequencer will graylist that node for at least this time interval. The sequencer will not pick graylisted nodes for copysets unless --gray-list-threshold is reached or no valid copyset can be selected from nodeset nodes not yet graylisted. For outlier-based graylisting increases exponentially for each new graylisting up until 10x of this value and decreases at linear rate down to this value when not graylisted	600s	server only
sticky-copysets-block-max-time	The time since starting the last block, after which the copyset manager will consider it expired and start a new one.	10min	requires restart, server only
sticky-copysets-block-size	The total size of processed appends (in bytes), after which the sticky copyset manager will start a new block.	33554432	requires restart, server only
store-timeout	timeout for attempts to store a record copy on a specific storage node. This value is used by sequencers only and is NOT the client request timeout.	10ms..1min	server only
unroutable-retry-interval	Time interval during which a sequencer will not pick for copysets a storage node whose IP address was reported unroutable by the socket layer	60s	server only
use-sequencer-affinity	If true, the routing of append requests to sequencers will first try to find a sequencer in the location given by sequencerAffinity() before looking elsewhere.	false
verify-checksum-before-replicating	If set, sequencers and rebuilding will verify checksums of records that have checksums. If there is a mismatch, sequencer will reject the append. Note that this setting doesn't make storage nodes verify checksums. Note that if not set, and --rocksdb-verify-checksum-during-store is set, a corrupted record kills write-availability for that log, as the appender keeps retrying and storage nodes reject the record.	true	server only
write-shard-id-in-copyset	Serialize copysets using ShardIDs instead of node_index_t on disk. TODO(T15517759): enable by default once Flexible Log Sharding is fully implemented and this has been thoroughly tested.	false	experimental, server only