KHyperLogLog Functions

KHyperLogLog is a data sketch for estimating reidentifiability and joinability within a dataset. Based on the KHyperLogLog paper, it maintains a map of K number of HyperLogLog structures, where each entry corresponds to a unique key from one column, and the HLL estimates the cardinality of the associated unique identifiers from another column.

Data Structures

A KHyperLogLog is a data sketch which stores approximate cardinality information for key-value associations. The Velox type for this data structure is called KHyperLogLog. For storage and retrieval, KHyperLogLog values may be cast to/from VARBINARY.

Serialization format is compatible with Presto’s.

Aggregate Functions

khyperloglog_agg(x, uii) KHyperLogLog

Returns the KHyperLogLog sketch which summarizes the association between the key column x and the unique identifier column uii. The x parameter represents the key values and uii represents the unique identifiers associated with each key.

merge(KHyperLogLog) KHyperLogLog

Returns the KHyperLogLog of the aggregate union of the individual KHyperLogLog structures.

Scalar Functions

cardinality(khll) bigint

Returns the estimated total cardinality (number of unique keys) from the KHyperLogLog sketch khll.

intersection_cardinality(khll1, khll2) bigint

Returns the estimated intersection cardinality between two KHyperLogLog sketches. If both sketches are exact (small cardinality), returns the exact intersection count. Otherwise, returns an approximation using the Jaccard index.

jaccard_index(khll1, khll2) double

Returns the Jaccard index (similarity coefficient) between two KHyperLogLog sketches. The Jaccard index is a value in [0, 1] where:

  • 1.0 means the sets are identical

  • 0.0 means the sets are disjoint (no overlap)

merge_khll(array(KHyperLogLog)) KHyperLogLog

Returns the KHyperLogLog of the union of an array of KHyperLogLog structures.

  • Returns NULL if the input array is NULL, empty, or contains only NULL elements

  • Ignores NULL elements and merges only valid KHyperLogLog structures when the array contains a mix of NULL and non-null elements

reidentification_potential(khll, threshold) double

Returns the reidentification potential of the KHyperLogLog sketch khll at the given threshold. This measures the fraction of keys that have cardinality at or below the threshold, which indicates how easily those keys could be reidentified.

uniqueness_distribution(khll)

Returns a histogram map representing the distribution of uniqueness values in the KHyperLogLog sketch khll. Each key in the map represents a cardinality bucket, and the value represents the fraction of keys falling into that bucket. The histogram size defaults to the minhash size of the KHyperLogLog instance.

uniqueness_distribution(khll, histogramSize)

Returns a histogram map representing the distribution of uniqueness values in the KHyperLogLog sketch khll with the specified histogramSize. Each key in the map represents a cardinality bucket, and the value represents the fraction of keys falling into that bucket.