KHyperLogLog Functions¶
KHyperLogLog is a data sketch for estimating reidentifiability and joinability within a dataset. Based on the KHyperLogLog paper, it maintains a map of K number of HyperLogLog structures, where each entry corresponds to a unique key from one column, and the HLL estimates the cardinality of the associated unique identifiers from another column.
Data Structures¶
A KHyperLogLog is a data sketch which stores approximate cardinality information for key-value
associations. The Velox type for this data structure is called KHyperLogLog.
For storage and retrieval, KHyperLogLog values may be cast to/from VARBINARY.
Serialization format is compatible with Presto’s.
Aggregate Functions¶
- khyperloglog_agg(x, uii) KHyperLogLog¶
Returns the
KHyperLogLogsketch which summarizes the association between the key columnxand the unique identifier columnuii. Thexparameter represents the key values anduiirepresents the unique identifiers associated with each key.
- merge(KHyperLogLog) KHyperLogLog¶
Returns the
KHyperLogLogof the aggregate union of the individualKHyperLogLogstructures.
Scalar Functions¶
- cardinality(khll) bigint¶
Returns the estimated total cardinality (number of unique keys) from the
KHyperLogLogsketchkhll.
- intersection_cardinality(khll1, khll2) bigint¶
Returns the estimated intersection cardinality between two
KHyperLogLogsketches. If both sketches are exact (small cardinality), returns the exact intersection count. Otherwise, returns an approximation using the Jaccard index.
- jaccard_index(khll1, khll2) double¶
Returns the Jaccard index (similarity coefficient) between two
KHyperLogLogsketches. The Jaccard index is a value in [0, 1] where:1.0 means the sets are identical
0.0 means the sets are disjoint (no overlap)
- merge_khll(array(KHyperLogLog)) KHyperLogLog¶
Returns the
KHyperLogLogof the union of an array ofKHyperLogLogstructures.Returns
NULLif the input array isNULL, empty, or contains onlyNULLelementsIgnores
NULLelements and merges only validKHyperLogLogstructures when the array contains a mix ofNULLand non-null elements
- reidentification_potential(khll, threshold) double¶
Returns the reidentification potential of the
KHyperLogLogsketchkhllat the giventhreshold. This measures the fraction of keys that have cardinality at or below the threshold, which indicates how easily those keys could be reidentified.
- uniqueness_distribution(khll)¶
Returns a histogram map representing the distribution of uniqueness values in the
KHyperLogLogsketchkhll. Each key in the map represents a cardinality bucket, and the value represents the fraction of keys falling into that bucket. The histogram size defaults to the minhash size of the KHyperLogLog instance.
- uniqueness_distribution(khll, histogramSize)
Returns a histogram map representing the distribution of uniqueness values in the
KHyperLogLogsketchkhllwith the specifiedhistogramSize. Each key in the map represents a cardinality bucket, and the value represents the fraction of keys falling into that bucket.