RowNumber and TopNRowNumber Fuzzer¶
The RowNumberFuzzer and TopNRowNumberFuzzer are testing tools that automatically generate equivalent query plans that use the RowNumber and TopNRowNumber Velox plan nodes, and then execute these plans to validate the consistency of the results. They works as follows:
Data Generation: Generate a random set of input data, also known as a vector. This data can have a variety of encodings and data layouts to ensure thorough testing.
Plan Generation: Generate equivalent query plans and validate results across all of them.
For RowNumberFuzzer:
Base plan: RowNumber over ValuesNode.
Alternative plan: RowNumber over TableScanNode.
For TopNRowNumberFuzzer:
Base plan: TopNRowNumber over ValuesNode using a randomly chosen rank function (
row_number,rank, ordense_rank) with a random row limit.Alternative plan: WindowNode over ValuesNode using the same rank function and partition/sort keys, followed by a filter on the rank value to apply the same limit. This validates that the optimised TopNRowNumber operator produces results consistent with the general Window operator.
Alternative plan: TopNRowNumber over TableScanNode using the same rank function and limit.
Query Execution: Executes those equivalent query plans using the generated data and asserts that the results are consistent across different plans.
Execute the base plan, compare the result with the reference (DuckDB or Presto) and use it as the expected result.
Execute the alternative plans multiple times with and without spill, and compare each result with the expected result.
Iteration: This process is repeated multiple times to ensure reliability and robustness.
How to run¶
Use velox_row_number_fuzzer to run RowNumberFuzzer
velox/exec/fuzzer/velox_row_number_fuzzer --seed 123 --duration_sec 60
Similarly, use velox_topn_row_number_fuzzer to run TopNRowNumberFuzzer
velox/exec/fuzzer/velox_topn_row_number_fuzzer --seed 123 --duration_sec 60
By default, the fuzzer will go through 10 iterations. Use –steps or –duration-sec flag to run fuzzer for longer. Use –seed to reproduce fuzzer failures.
Here is a full list of supported command line arguments.
–-steps: How many iterations to run. Each iteration generates and evaluates one expression or aggregation. Default is 10.–-duration_sec: For how long to run in seconds. If both-–stepsand-–duration_secare specified, –duration_sec takes precedence.–-seed: The seed to generate random expressions and input vectors with.–-v=1: Verbose logging (from Google Logging Library).–-batch_size: The size of input vectors to generate. Default is 100.--num_batches: The number of input vectors of size –batch_size to generate. Default is 5.--enable_spill: Whether to test with spilling or not. Default is true.--presto_urlThe PrestoQueryRunner url along with its port number.--req_timeout_msTimeout in milliseconds of an HTTP request to the PrestoQueryRunner.--arbitrator_capacity: Arbitrator capacity in bytes. Default is 6L << 30.--allocator_capacity: Allocator capacity in bytes. Default is 8L << 30.
If running from CLion IDE, add --logtostderr=1 to see the full output.