November 2023 Update

Core Library

  • Add spilling support for aggregations over distinct or sorted inputs. #7305, #7526

  • Add support to lazily create the spill directory. #7660

  • Add config merge_exchange.max_buffer_size to limit the total memory used by exchange clients. #7410

  • Add configs sort_writer_max_output_rows and sort_writer_max_output_bytes to limit memory usage of sort writer. #7339

  • Add termination time to TaskStats. This is the time when the downstream workers finish consuming results. Clients such as Prestissimo can use this metric to clean up tasks. #7479

  • Add Presto Type Parser based on Flex and Bison. This can be used by a Presto verifier to parse types in the response from the Presto server #7568

  • Add support for named row fields in type signature and binding. Example: row(foo bigint) in a signature only binds to inputs whose type is row with a single BIGINT field named foo. #7523

  • Add support to shrink cache if clients such as Prestissimo detect high memory usage on a worker. #7547, #7645

  • Fix distinct aggregations with global grouping sets on empty input. Instead of empty results, for global grouping sets, the expected result is a row per global grouping set with the groupId as the key value. #7353

  • Fix incorrect runtime stats reporting when memory arbitration is triggered. #7394

  • Fix Timestamp::toMillis() to overflow only if the final result overflows. #7506

Presto Functions

  • Add cosine_similarity() scalar function.

  • Add support for INTERVAL DAY TO SECOND type input to plus(), minus(), multiply() functions.

  • Add support for combination of TIMESTAMP, INTERVAL DAY TO SECOND type inputs to plus(), minus() functions.

  • Add support for INTERVAL DAY TO SECOND, DOUBLE input arguments to divide() function.

  • Add support to allow non-constant IN list in IN Presto predicate. #7497

  • Register array_frequency() function for all primitive types.

  • Fix bitwise shift functions to accept shift value 0.

  • Fix url_extract_* functions to return null on malformed inputs and support absolute URIs.

  • Fix from_utf8() handling of invalid UTF-8 codepoint. #7442

  • Fix entropy() aggregate function to return 0.0 on null inputs.

  • Fix array_sort() function from producing invalid dictionary vectors. #7800

  • Fix lead(), lag() window functions to return null when the offset is null. #7254

  • Fix DECIMAL to VARCHAR cast by adding trailing zeros when the value is 0. #7588

Spark Functions

Hive Connector

  • Add DirectBufferedInput: a selective BufferedInput without caching. #7217

  • Add support for reading UNSIGNED INTEGER types in Parquet format. #6728

  • Add spill support for DWRF sort writer. #7326

  • Add file_handle_cache_enabled Hive Config to enable or disable caching file handles.

  • Add documentation for num_cached_file_handles configuration property.

  • Add support for DECIMAL and VARCHAR types in BenchmarkParquetReader. #6275

Arrow

  • Add support to export constant vector as Arrow REE array. #7327, #7398

  • Add support for TIMESTAMP type in Arrow bridge. #7435

  • Fix Arrow bridge to ensure the null_count is always set and add support for null constants. #7411

Performance and Correctness

  • Add PrestoQueryRunner that can be used to verify test results against Presto. #7628

  • Add support for plans with TableScan in Join Fuzzer. #7571

  • Add support for custom input generators in Aggregation Fuzzer. #7594

  • Add support for aggregations over sorted inputs in AggregationFuzzer #7620

  • Add support for custom result verifiers in AggregationFuzzer. #7674

  • Add custom verifiers for approx_percentile() and approx_distinct() in AggregationFuzzer. #7654

  • Optimize map subscript by caching input keys in a hash map. #7191

  • Optimize FlatVector<StringView>::copy() slow path using a DecodedVector and pre-allocated the string buffer. #7357

  • Optimize element_at for maps with complex type keys by sorting the keys and using binary search. #7365

  • Optimize concat() by adding a fast path for primitive values. #7393

  • Optimize json_parse() function exception handling by switching to simdjson. #7658

  • Optimize add_items for VARCHAR type by avoiding a deep copy. #7395

  • Optimize remaining filter by lazily evaluating multi-referenced fields. #7433

  • Optimize TopN::addInput() by deferring copying of the non-key columns. #7172

  • Optimize by sorting the inputs once when multiple aggregations share sorting keys and orders. #7452

  • Optimize Exchange operator by allowing merging of small batches of data into larger vectors. #7404

Build

  • Add DuckDB version 0.8.1 as an external dependency and remove DuckDB amalgamation. #6725

  • Add libcpr a lightweight http client. #7385

  • Upgrade Arrow dependency to 14.0.1 from 13.0.0.

Credits

Alex Hornby, Amit Dutta, Andrii Rosa, Austin Dickey Bikramjeet Vig, Cheng Huang, Chengcheng Jin, Christopher Ponce de Leon, Daniel Munoz, Deepak Majeti, Ge Gao, Genevieve (Genna) Helsel, Harvey Hunt, Jake Jung, Jia, Jia Ke, Jialiang Tan, Jimmy Lu, John Elliott, Karteekmurthys, Ke, Kevin Wilfong, Krishna Pai, Laith Sakka, Masha Basmanova, Orri Erling, PHILO-HE, Patrick Sullivan, Pedro Eugenio Rocha Pedreira, Pramod, Richard Barnes, Schierbeck, Cody, Sergey Pershin, Wei He, Zhenyuan Zhao, aditi-pandit, curt, duanmeng, joey.ljy, lingbin, rui-mo, usurai, vibhatha, wypb, xiaoxmeng, xumingming, yangchuan, yaqi-zhao, yingsu00, yiweiHeOSS, youxiduo, zhli, 高阳阳