I/O Optimizations and the TpchBenchmark¶

Introduction¶

This document is the outcome from a cycle of benchmarking to determine the best I/O performance against AWS S3 Parquet data for TPCH. It is intended to describe the tuning process recommendations and general suggestions for tuning the Velox query engine.

Benchmarking in Velox is made easy with the optionally built TpchBenchmark (velox_tpch_benchmark) executable. To build the benchmark executable (_build/release/velox/benchamrks/tpch/velox_tpch_benchmark), use the following command line to do the build with S3 support:

$ make benchmarks-build EXTRA_CMAKE_FLAGS="-DVELOX_ENABLE_S3=ON"

Use Cases for Velox¶

Velox is a library so there are multiple ways to use it. This document will describe two models used by popular applications.

In-Process Multi-Threaded Executor Use Case¶

This use case is used by Presto and, as it turns out, is the use case used by the Velox TpchBenchmark Tool below. This use case uses a single multi-threaded process to perform execution of queries in parallel. Each query is broken up into threads called drivers via a planning algorithm. Each driver may also have a thread pool to perform I/O in a parallel manner. In the benchmarking tool both driver thread count, and I/O thread count are exposed as command line configuration options. In this use case, care must be taken to not create too many threads since maximum number of threads is a product of driver threads and I/O threads. In this use case, the application owns creating drivers and I/O Thread pools for Velox. In the case of the benchmark tool, the tool is responsible for both driver threads and I/O threads.

Multiple Process Executor Use Case¶

This use case is used by Spark + Gluten and it differs from the Presto use case where parallelism is concerned. Spark uses multiple processes where each process is a Gluten+Velox query processor. Spark scales by using many Linux processes for query processing. In this case this means that the drivers are outside of Velox and Gluten and is defined by the Spark configuration and number of workers. Gluten takes on the role of creating and exposing the I/O thread pool count to Spark as configuration and then injecting the I/O thread pool into Velox for parallel I/O.

Built-In TpchBenchmark¶

This tool uses the “in-process” multi-threaded executor use case. This tool exposes quite a few benchmark options as command line options. It also exposes many Velox internal options. This document will only cover a subset of the options and possible best-known values for them.

The setup used in experiments leading to this document was an AWS instance ri6-8xlarge (32 vCPUs; 256GB RAM). The values (or formulas) below are based on these experiments.

TpchBenchmark Tool Optimizations in Velox¶

The tool exposes the following options (Note: When passing options for the tool use a single dash not a double dash; i.e. -option and not –option):

num_drivers - As described above this represent the number of drivers or executors used to process the TPC-H queries.
num_io_threads - This represents the number of I/O threads used per driver for I/O.
cache_gb - How much memory is used for caching data locally in this process. Memory caching cannot be shared in the multiple process executor use case but works well in the single process, multi-threaded use case.
num_splits_per_file - This is a row group optimization for the stored dataset for benchmarking.

NOTE: There is a limitation on the implementation of the AWS SDK that will cause failures (curl error 28) if the **driver* threads times I/O threads grow much beyond 350 threads. This only really effects the multi-threaded **drivers* use case like the benchmark tool. It is only known to be an issue when running against AWS S3. However, the error is coming from the libcurl library so it is possible other Cloud storage APIs could also be affected.

Velox exposes other options used for tuning that are of interest:

max_coalesce_bytes - Size of coalesced data, has small improvements as size grows.
max_coalesce_distance_bytes - Maximum gap bytes between data that can coalesced. Larger may mean more fetched data but at greater bytes/sec.

Top Optimization Recommendations¶

Starting Point for Tuning¶
Option Name	Single Process	Multi-Process
num_drivers	max(20, vCPUs<super>*</super> X 3 / 4)	NA
num_io_threads	max(16, vCPUs<super>*</super> X 3 / 8)	vCPUs*
cache_gb	50% System RAM	NA (default = 0)
num_splits_per_file	Row Group Size of Data	Row Group Size of Data
max_coalesced_bytes	Minimum of 90MB	Minimum of 90MB
max_coalesced_distance_bytes	Workload dependent**	Workload dependent**
parquet_prefetch_rowgroups	Best is 1	Best is 1
split_preload_per_driver	Best is 2	Best is 2
cache_prefetch_min_pct	Best is 80	Best is 80

* vCPUs = (cores * hyper-threads)

** Wide tables and few columns retrieved per row can lead to many I/O requests, suggest increasing this value based on testing and the need to reduce small I/O requests.

Optimizations for the TpchBenchmark (Single Process Use Case)¶

num_drivers¶

This configuration option describes the number of resources available to process query plan tasks. This can greatly improve performance of CPU bound workloads. The recommendation in the table above seems to be optimal for TPCH-like workloads.

num_io_threads¶

This configuration option describes the number of threads available per driver to retrieve data from the network source. If the workload is I/O bound, then increasing this number is beneficial if the data is fewer requests of larger chunks as opposed to many smaller requests.

cache_gb¶

This configuration option is useful for workloads that read the same data several times per query but only applies to the single process use case. NOTE: There is a SSD Caching option in Velox but it to is ONLY useful in the single process use case.

num_splits_per_file¶

This configuration option is best when the data set count of row groups matches this value. The affect in overall performance appears based on testing to be small, however.

Optimizations for All Workloads (Both Use Cases)¶

max_coalesce_bytes¶

This configuration option is the maximum bytes coalesced into a single request to the data source. This was tested from the default 128MB to 2GB, and the overall improvement was small as size increased. Capturing request data did show larger and fewer requests but not enough to vastly improve I/O performance.

max_coalesce_distance_bytes¶

This configuration option is the maximum byte distance between needed data in the same file at the data source that can be coalesced. Increasing this value would theoretically reduce the number of requests and increase each request size. However, if made too large the query will return too many un-needed bytes and could decrease I/O performance. This plus __max_coalesce_bytes__ should be fine-tuned for the workload being run.

Summary¶

If a use of Velox matches the use case of the TcphBenchmark then it is a good tool to test, I/O and driver performance for specific TCP-H queries. This would benefit execution of specific production workloads that are like the chosen queries. If in multi-process use case, like Spark/Gluten/Velox configuration, the recommendation is to oversubscribe I/O threads between 2X and 3X vCPUs and tune the 2 coalesce configurations exposed.

Appendix A: TpchBenchmark Tool Help Output¶

From the repository root, use the following command line to see all the available flags in the TpchBenchmark tool.

$ ./_build/release/velox/benchmarks/tpch/velox_tpch_benchmark -helpon=TpchBenchmark