SetDigest Functions

SetDigest is a data sketch for estimating set cardinality and performing set operations like intersection cardinality and Jaccard index. It combines HyperLogLog for cardinality estimation with MinHash for exact counting and intersection operations.

SetDigests may be merged, and for storage and retrieval they may be cast to/from VARBINARY.

Data Structures

A SetDigest is a data sketch which stores approximate set membership and cardinality information. The Velox type for this data structure is called SetDigest. SetDigests support two element types internally:

  • bigint - for integer values (all numeric types are converted to bigint)

  • varchar - for string values

When a SetDigest is exact (cardinality is less than the maximum hash limit), operations like intersection cardinality return exact results. When the digest becomes approximate (high cardinality), it uses HyperLogLog and MinHash estimation.

Serialization format is compatible with Presto’s.

Aggregate Functions

make_set_digest(x) SetDigest

Returns the SetDigest sketch which summarizes the input data set of x. Supported input types include: boolean, tinyint, smallint, integer, bigint, real, double, date, varchar, and varbinary.

merge_set_digest(SetDigest) SetDigest

Returns the SetDigest of the aggregate union of the individual SetDigest structures.

Scalar Functions

cardinality(setdigest) bigint

Returns the estimated cardinality of the set represented by the SetDigest sketch. If the digest is exact (low cardinality), returns the exact count. Otherwise, returns an approximation using HyperLogLog.

intersection_cardinality(setdigest1, setdigest2) bigint

Returns the estimated intersection cardinality between two SetDigest sketches.

  • If both digests are exact: returns the exact intersection count

  • If either digest is approximate: returns an estimation using the Jaccard index

The result is capped at the minimum cardinality of the two input digests to ensure logical consistency.

jaccard_index(setdigest1, setdigest2) double

Returns the Jaccard index (similarity coefficient) between two SetDigest sketches. The Jaccard index is a value in [0, 1] where:

  • 1.0 means the sets are identical

  • 0.0 means the sets are disjoint (no overlap)

Uses MinHash estimation for efficient computation.