Configuration properties¶
Generic Configuration¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
preferred_output_batch_bytes |
integer |
10MB |
Preferred size of batches in bytes to be returned by operators from Operator::getOutput. It is used when an estimate of average row size is known. Otherwise preferred_output_batch_rows is used. |
preferred_output_batch_rows |
integer |
1024 |
Preferred number of rows to be returned by operators from Operator::getOutput. It is used when an estimate of average row size is not known. When the estimate of average row size is known, preferred_output_batch_bytes is used. |
max_output_batch_rows |
integer |
10000 |
Max number of rows that could be return by operators from Operator::getOutput. It is used when an estimate of average row size is known and preferred_output_batch_bytes is used to compute the number of output rows. |
abandon_partial_aggregation_min_rows |
integer |
10000 |
Min number of rows when we check if a partial aggregation is not reducing the cardinality well and might be a subject to being abandoned. |
abandon_partial_aggregation_min_pct |
integer |
80 |
If a partial aggregation’s number of output rows constitues this or highler percentage of the number of input rows, then this partial aggregation will be a subject to being abandoned. |
session_timezone |
string |
User provided session timezone. Stores a string with the actual timezone name, e.g: “America/Los_Angeles”. |
|
adjust_timestamp_to_session_timezone |
bool |
false |
If true, timezone-less timestamp conversions (e.g. string to timestamp, when the string does not specify a timezone) will be adjusted to the user provided session_timezone (if any). For instance: if this option is true and user supplied “America/Los_Angeles”, then “1970-01-01” will be converted to -28800 instead of 0. |
track_operator_cpu_usage |
bool |
true |
Whether to track CPU usage for stages of individual operators. Can be expensive when processing small batches, e.g. < 10K rows. |
hash_adaptivity_enabled |
bool |
true |
If false, the ‘group by’ code is forced to use generic hash mode hashtable. |
adaptive_filter_reordering_enabled |
bool |
true |
If true, the conjunction expression can reorder inputs based on the time taken to calculate them. |
max_local_exchange_buffer_size |
integer |
32MB |
Used for backpressure to block local exchange producers when the local exchange buffer reaches or exceeds this size. |
max_page_partitioning_buffer_size |
integer |
32MB |
The target size for a Task’s buffered output. The producer Drivers are blocked when the buffered size exceeds this. The Drivers are resumed when the buffered size goes below PartitionedOutputBufferManager::kContinuePct (90)% of this. |
Expression Evaluation Configuration¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
expression.eval_simplified |
boolean |
false |
Whether to use the simplified expression evaluation path. |
expression.track_cpu_usage |
boolean |
false |
Whether to track CPU usage for individual expressions (supported by call and cast expressions). Can be expensive when processing small batches, e.g. < 10K rows. |
cast_match_struct_by_name |
bool |
false |
This flag makes the Row conversion to by applied in a way that the casting row field are matched by name instead of position. |
cast_to_int_by_truncate |
bool |
false |
This flags forces the cast from float/double to integer to be performed by truncating the decimal part instead of rounding. |
Memory Management¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
max_partial_aggregation_memory |
integer |
16MB |
Maximum amount of memory in bytes for partial aggregation results. Increasing this value can result in less network transfer and lower CPU utilization by allowing more groups to be kept locally before being flushed, at the cost of additional memory usage. |
max_extended_partial_aggregation_memory |
integer |
16MB |
Maximum amount of memory in bytes for partial aggregation results if cardinality reduction is below partial_aggregation_reduction_ratio_threshold. Every time partial aggregate results size reaches max_partial_aggregation_memory bytes, the results are flushed. If cardinality reduction is below partial_aggregation_reduction_ratio_threshold, i.e. number of result rows / number of input rows > partial_aggregation_reduction_ratio_threshold, memory limit for partial aggregation is automatically doubled up to max_extended_partial_aggregation_memory. This adaptation is disabled by default, since the value of max_extended_partial_aggregation_memory equals the value of max_partial_aggregation_memory. Specify higher value for max_extended_partial_aggregation_memory to enable. |
Spilling¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
spill_enabled |
boolean |
false |
Spill memory to disk to avoid exceeding memory limits for the query. |
aggregation_spill_enabled |
boolean |
false |
When spill_enabled is true, determines whether to spill memory to disk for aggregations to avoid exceeding memory limits for the query. |
join_spill_enabled |
boolean |
false |
When spill_enabled is true, determines whether to spill memory to disk for hash joins to avoid exceeding memory limits for the query. |
order_by_spill_enabled |
boolean |
false |
When spill_enabled is true, determines whether to spill memory to disk for order by to avoid exceeding memory limits for the query. |
aggregation_spill_memory_threshold |
integer |
0 |
Maximum amount of memory in bytes that a final aggregation can use before spilling. 0 means unlimited. |
join_spill_memory_threshold |
integer |
0 |
Maximum amount of memory in bytes that a hash join build side can use before spilling. 0 means unlimited. |
order_by_spill_memory_threshold |
integer |
0 |
Maximum amount of memory in bytes that an order by can use before spilling. 0 means unlimited. |
spillable_reservation_growth_pct |
integer |
25 |
The spillable memory reservation growth percentage of the current memory reservation size. Suppose a growth percentage of N and the current memory reservation size of M, the next memory reservation size will be M * (1 + N / 100). After growing the memory reservation K times, the memory reservation size will be M * (1 + N / 100) ^ K. Hence the memory reservation grows along a series of powers of (1 + N / 100). If the memory reservation fails, it starts spilling. |
max_spill_level |
integer |
4 |
The maximum allowed spilling level with zero being the initial spilling level. Applies to hash join build spilling which might use recursive spilling when the build table is very large. -1 means unlimited. In this case an extremely large query might run out of spilling partition bits. The max spill level can be used to prevent a query from using too much io and cpu resources. |
max_spill_file_size |
integer |
0 |
The maximum allowed spill file size. Zero means unlimited. |
min_spill_run_size |
integer |
256MB |
The minimum spill run size (bytes) limit used to select partitions for spilling. The spiller tries to spill a previously spilled partitions if its data size exceeds this limit, otherwise it spills the partition with most data. If the limit is zero, then the spiller always spills a previously spilled partition if it has any data. This is to avoid spill from a partition with a small amount of data which might result in generating too many small spilled files. |
spiller_start_partition_bit |
integer |
29 |
The start partition bit which is used with spiller_partition_bits together to calculate the spilling partition number. |
spiller_partition_bits |
integer |
2 |
The number of bits used to calculate the spilling partition number. The number of spilling partitions will be power of two. At the moment the maximum value is 3, meaning we only support up to 8-way spill partitioning. |
testing.spill_pct |
integer |
0 |
Percentage of aggregation or join input batches that will be forced to spill for testing. 0 means no extra spilling. |
Codegen Configuration¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
codegen.enabled |
boolean |
false |
Along with codegen.configuration_file_path enables codegen in task execution path. |
codegen.configuration_file_path |
string |
A path to the file contaning codegen options. |
|
codegen.lazy_loading |
boolean |
true |
Triggers codegen initialization tests upon loading if false. Otherwise skips them. |
Hive Connector¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
max_partitions_per_writers |
integer |
100 |
Maximum number of partitions per a single table writer instance. |
insert_existing_partitions_behavior |
string |
ERROR |
Allowed values: |
immutable_partitions |
bool |
false |
True if appending data to an existing unpartitioned table is allowed. Currently this configuration does not support appending to existing partitions. |
Amazon S3 Configuration
¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
hive.s3.use-instance-credentials |
bool |
true |
Use the EC2 metadata service to retrieve API credentials. This works with IAM roles in EC2. |
hive.s3.aws-access-key |
string |
Default AWS access key to use. |
|
hive.s3.aws-secret-key |
string |
Default AWS secret key to use. |
|
hive.s3.endpoint |
string |
The S3 storage endpoint server. This can be used to connect to an S3-compatible storage system instead of AWS. |
|
hive.s3.path-style-access |
bool |
false |
Use path-style access for all requests to the S3-compatible storage. This is for S3-compatible storage that doesn’t support virtual-hosted-style access. |
hive.s3.ssl.enabled |
bool |
true |
Use HTTPS to communicate with the S3 API. |
hive.s3.log-level |
string |
FATAL |
Allowed values: “OFF”, “FATAL”, “ERROR”, “WARN”, “INFO”, “DEBUG”, “TRACE” Granularity of logging generated by the AWS C++ SDK library. |
hive.s3.iam-role |
string |
IAM role to assume. |
|
hive.s3.iam-role-session-name |
string |
velox-session |
Session name associated with the IAM role. |
Spark-specific Configuration¶
Property Name |
Type |
Default Value |
Description |
---|---|---|---|
spark.legacy_size_of_null |
bool |
true |
If false, |