sleeper/docs/usage/properties/instance/user/table_property_defaults.md at develop · gchq/sleeper

Instance Properties - Table Property Defaults - User Defined

The following instance properties relate to default values used by table properties.

Property Name	Description	Default Value	Run CDK Deploy When Changed
sleeper.default.table.query.parquet.column.index.enabled	Used during Sleeper queries to determine whether the column/offset indexes (also known as page indexes) are read from Parquet files. For some queries, e.g. single/few row lookups this can improve performance by enabling more aggressive pruning. On range queries, especially on large tables this can harm performance, since readers will read the extra index data before returning results, but with little benefit from pruning.	false	false
sleeper.default.table.parquet.rowgroup.size	Maximum number of bytes to write in a Parquet row group (default is 8MiB). This property is NOT used by DataFusion data engine.	8388608	false
sleeper.default.table.parquet.page.size	The size of the pages in the Parquet files (default is 128KiB).	131072	false
sleeper.default.table.parquet.compression.codec	The compression codec to use in the Parquet files. Valid values are: [uncompressed, snappy, gzip, lzo, brotli, lz4, zstd]	zstd	false
sleeper.default.table.parquet.dictionary.encoding.rowkey.fields	Whether dictionary encoding should be used for row key columns in the Parquet files.	false	false
sleeper.default.table.parquet.dictionary.encoding.sortkey.fields	Whether dictionary encoding should be used for sort key columns in the Parquet files.	false	false
sleeper.default.table.parquet.dictionary.encoding.value.fields	Whether dictionary encoding should be used for value columns in the Parquet files.	false	false
sleeper.default.table.parquet.columnindex.truncate.length	Used to set parquet.columnindex.truncate.length, see documentation here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md The length in bytes to truncate binary values in a column index.	128	false
sleeper.default.table.parquet.statistics.truncate.length	Used to set parquet.statistics.truncate.length, see documentation here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md The length in bytes to truncate the min/max binary values in row groups.	2147483647	false
sleeper.default.table.datafusion.s3.readahead.enabled	Enables a cache of data when reading from S3 with the DataFusion data engine, to hold data in larger blocks than are requested by DataFusion.	true	false
sleeper.default.table.parquet.writer.version	Used to set parquet.writer.version, see documentation here: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md Can be either v1 or v2. The v2 pages store levels uncompressed while v1 pages compress levels with the data.	v2	false
sleeper.default.table.parquet.rowgroup.rows.max	Maximum number of rows to write in a Parquet row group.	100000	false
sleeper.default.table.statestore.transactionlog.add.transaction.max.attempts	The number of attempts to make when applying a transaction to the state store. This default can be overridden by a table property.	10	false
sleeper.default.table.statestore.transactionlog.add.transaction.first.retry.wait.ceiling.ms	The maximum amount of time to wait before the first retry when applying a transaction to the state store. Full jitter will be applied so that the actual wait time will be a random period between 0 and this value. This ceiling will increase exponentially on further retries. See the below article. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ This default can be overridden by a table property.	200	false
sleeper.default.table.statestore.transactionlog.add.transaction.max.retry.wait.ceiling.ms	The maximum amount of time to wait before any retry when applying a transaction to the state store. Full jitter will be applied so that the actual wait time will be a random period between 0 and this value. This restricts the exponential increase of the wait ceiling while retrying the transaction. See the below article. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ This default can be overridden by a table property.	30000	false
sleeper.default.table.statestore.transactionlog.files.snapshot.batch.size	The number of elements to include per Arrow row batch in a snapshot derived from the transaction log, of the state of files in a Sleeper table. Each file includes some number of references on different partitions. Each reference will count for one element in a row batch, but a file cannot currently be split between row batches. A row batch may contain more file references than this if a single file overflows the batch. A file with no references counts as one element.	1000	false
sleeper.default.table.statestore.transactionlog.partitions.snapshot.batch.size	The number of partitions to include per Arrow row batch in a snapshot derived from the transaction log, of the state of partitions in a Sleeper table.	1000	false
sleeper.default.table.statestore.transactionlog.time.between.snapshot.checks.seconds	The number of seconds to wait after we've loaded a snapshot before looking for a new snapshot. This should relate to the rate at which new snapshots are created, configured in the instance property `sleeper.statestore.transactionlog.snapshot.creation.lambda.period.seconds`. This default can be overridden by a table property.	60	false
sleeper.default.table.statestore.transactionlog.time.between.transaction.checks.ms	The number of milliseconds to wait after we've updated from the transaction log before checking for new transactions. The state visible to an instance of the state store can be out of date by this amount. This can avoid excessive queries by the same process, but can result in unwanted behaviour when using multiple state store objects. When adding a new transaction to update the state, this will be ignored and the state will be brought completely up to date. This default can be overridden by a table property.	0	false
sleeper.default.table.statestore.transactionlog.snapshot.load.min.transactions.ahead	The minimum number of transactions that a snapshot must be ahead of the local state, before we load the snapshot instead of updating from the transaction log.	10	false
sleeper.default.table.statestore.transactionlog.snapshot.expiry.days	The number of days that transaction log snapshots remain in the snapshot store before being deleted.	2	false
sleeper.default.table.statestore.transactionlog.delete.behind.snapshot.min.age.minutes	The minimum age in minutes of a snapshot in order to allow deletion of transactions leading up to it. When deleting old transactions, there's a chance that processes may still read transactions starting from an older snapshot. We need to avoid deletion of any transactions associated with a snapshot that may still be used as the starting point for reading the log.	2	false
sleeper.default.table.statestore.transactionlog.delete.number.behind.latest.snapshot	The minimum number of transactions that a transaction must be behind the latest snapshot before being deleted. This is the number of transactions that will be kept and protected from deletion, whenever old transactions are deleted. This includes the transaction that the latest snapshot was created against. Any transactions after the snapshot will never be deleted as they are still in active use. This should be configured in relation to the property which determines whether a process will load the latest snapshot or instead seek through the transaction log, since we need to preserve transactions that may still be read: sleeper.default.statestore.snapshot.load.min.transactions.ahead The snapshot that will be considered the latest snapshot is configured by a property to set the minimum age for it to count for this: sleeper.default.statestore.transactionlog.delete.behind.snapshot.min.age	200	false
sleeper.default.table.bulk.import.min.leaf.partitions	Specifies the minimum number of leaf partitions that are needed to run a bulk import job. If this minimum has not been reached, bulk import jobs will refuse to start.	256	false
sleeper.default.table.bulk.import.partition.splitting.attempts	Specifies the number of times bulk import tries to create leaf partitions to meet the minimum number of leaf partitions. This will be retried if another process splits the same partitions at the same time.	3	false
sleeper.default.table.ingest.batcher.job.min.size	Specifies the minimum total file size required for an ingest job to be batched and sent. An ingest job will be created if the batcher runs while this much data is waiting, and the minimum number of files is also met.	1G	false
sleeper.default.table.ingest.batcher.job.max.size	Specifies the maximum total file size for a job in the ingest batcher. If more data is waiting than this, it will be split into multiple jobs. If a single file exceeds this, it will still be ingested in its own job. It's also possible some data may be left for a future run of the batcher if some recent files overflow the size of a job but aren't enough to create a job on their own.	5G	false
sleeper.default.table.ingest.batcher.job.min.files	Specifies the minimum number of files for a job in the ingest batcher. An ingest job will be created if the batcher runs while this many files are waiting, and the minimum size of files is also met.	1	false
sleeper.default.table.ingest.batcher.job.max.files	Specifies the maximum number of files for a job in the ingest batcher. If more files are waiting than this, they will be split into multiple jobs. It's possible some data may be left for a future run of the batcher if some recent files overflow the size of a job but aren't enough to create a job on their own.	100	false
sleeper.default.table.ingest.batcher.file.max.age.seconds	Specifies the maximum time in seconds that a file can be held in the batcher before it will be included in an ingest job. When any file has been waiting for longer than this, jobs will be created for all the currently held files, even if other criteria for a batch are not met.	300	false
sleeper.default.table.ingest.batcher.ingest.queue	Specifies the target ingest queue where batched jobs are sent. Valid values are: [standard_ingest, bulk_import_emr, bulk_import_persistent_emr, bulk_import_eks, bulk_import_emr_serverless]	bulk_import_emr_serverless	false
sleeper.default.table.ingest.batcher.file.tracking.ttl.minutes	The time in minutes that the tracking information is retained for a file before the records of its ingest are deleted (eg. which ingest job it was assigned to, the time this occurred, the size of the file). The expiry time is fixed when a file is saved to the store, so changing this will only affect new data. Defaults to 1 week.	10080	false
sleeper.default.table.ingest.file.writing.strategy	Specifies the strategy that ingest uses to create files and references in partitions. Valid values are: [one_file_per_leaf, one_reference_per_leaf]	one_reference_per_leaf	false
sleeper.default.table.ingest.row.batch.type	The way in which rows are held in memory before they are written to a local store. Valid values are 'arraylist' and 'arrow'. The arraylist method is simpler, but it is slower and requires careful tuning of the number of rows in each batch.	arrow	false
sleeper.default.table.ingest.partition.file.writer.type	The way in which partition files are written to the main Sleeper store. Valid values are 'direct' (which writes using the s3a Hadoop file system) and 'async' (which writes locally and then copies the completed Parquet file asynchronously into S3). The direct method is simpler but the async method should provide better performance when the number of partitions is large.	async	false
sleeper.default.table.statestore.commit.async.behaviour	This is the default for whether state store updates will be applied asynchronously via the state store committer. This is usually only used for state store implementations where there's a benefit to applying state store updates in a single process for each Sleeper table. This is usually to avoid contention from multiple processes performing updates at the same time. This is separate from the properties that determine which state store updates will be done as asynchronous commits. Those properties will only be applied when asynchronous commits are enabled for a given state store. Valid values are: [disabled, per_implementation, all_implementations] With `disabled`, asynchronous commits will never be used unless overridden in table properties. With `per_implementation`, asynchronous commits will be used for all state store implementations that are known to benefit from it, unless overridden in table properties. With `all_implementations`, asynchronous commits will be used for all state stores unless overridden in table properties.	PER_IMPLEMENTATION	false
sleeper.default.table.compaction.job.id.assignment.commit.async	This is the default for whether created compaction jobs will be assigned to their input files asynchronously via the state store committer, if asynchronous commit is enabled. Otherwise, the compaction job creator will commit input file assignments directly to the state store.	true	false
sleeper.default.table.compaction.job.commit.async	This is the default for whether compaction tasks will commit finished jobs asynchronously via the state store committer, if asynchronous commit is enabled. Otherwise, compaction tasks will commit finished jobs directly to the state store.	true	false
sleeper.default.table.compaction.job.async.commit.batching	This property is the default for whether commits of compaction jobs are batched before being sent to the state store commit queue to be applied by the committer lambda. If this property is true and asynchronous commits are enabled then commits of compactions will be batched. If this property is false and asynchronous commits are enabled then commits of compactions will not be batched and will be sent directly to the committer lambda. This property can be overridden for individual tables.	true	false
sleeper.default.table.ingest.job.files.commit.async	This is the default for whether ingest tasks will add files asynchronously via the state store committer, if asynchronous commit is enabled. Otherwise, ingest tasks will add files directly to the state store.	true	false
sleeper.default.table.bulk.import.job.files.commit.async	This is the default for whether bulk import will add files asynchronously via the state store committer, if asynchronous commit is enabled. Otherwise, bulk import will add files directly to the state store.	true	false
sleeper.default.table.partition.splitting.commit.async	This is the default for whether partition splits will be applied asynchronously via the state store committer, if asynchronous commit is enabled. Otherwise, the partition splitter will apply splits directly to the state store.	true	false
sleeper.default.table.gc.commit.async	This is the default for whether the garbage collector will record deleted files asynchronously via the state store committer, if asynchronous commit is enabled. Otherwise, the garbage collector will record this directly to the state store.	true	false
sleeper.default.table.statestore.committer.update.every.commit	When using the transaction log state store, this sets whether to update from the transaction log before adding a transaction in the asynchronous state store committer. If asynchronous commits are used for all or almost all state store updates, this can be false to avoid the extra queries. If the state store is commonly updated directly outside of the asynchronous committer, this can be true to avoid conflicts and retries.	false	false
sleeper.default.table.statestore.committer.update.every.batch	When using the transaction log state store, this sets whether to update from the transaction log before adding a batch of transactions in the asynchronous state store committer.	true	false
sleeper.default.table.data.engine	Select which data engine to use for the table. Valid values are: [java, datafusion, datafusion_experimental]	DATAFUSION	false

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance Properties - Table Property Defaults - User Defined

FilesExpand file tree

table_property_defaults.md

Latest commit

History

table_property_defaults.md

File metadata and controls

Instance Properties - Table Property Defaults - User Defined