T-Digest Functions¶
T-digest and quantile digest are two older algorithms for estimating rank-based metrics. T-digest generally has better performance than quantile digest and better accuracy at the tails (often dramatically better), but may have worse accuracy at the median depending on the compression factor used. In comparison, quantile digest supports more numeric types and provides a maximum rank error guarantee, which ensures relative uniformity of precision along the quantiles. Quantile digests are also formally proven to support lossless merges, while T-digest is not (though it does empirically demonstrate lossless merges).
T-digest was developed by Ted Dunning and is more restrictive in its type support,
accepting only double
type parameters. This contrasts with quantile digest, which
supports a broader range of numeric types including bigint
, double
, and real
,
making quantile digest more versatile for different data types.
Velox uses the modern KLL sketch algorithm for the approx_percentile
function, which
provides stronger accuracy guarantees than both T-digest and quantile digest.
The T-digest functions documented here exist primarily to support
pre-existing workloads that have data stored using the T-digest format for backward compatibility.
Data Structures¶
A T-digest is a data sketch which stores approximate percentile information.
The Velox type for this data structure is called tdigest
,
and it accepts a parameter of type double
which represents the set of
numbers to be ingested by the tdigest
.
T-digests may be merged without losing precision, and for storage and retrieval
they may be cast to/from VARBINARY
.
Functions¶
- construct_tdigest(means: array<double>, counts: array<integer>, compression: double, min: double, max: double, sum: double, count: bigint) tdigest<double> ¶
Constructs a T-digest from the given parameters:
means
- array of centroid meanscounts
- array of centroid counts (weights)compression
- compression factormin
- minimum valuemax
- maximum valuesum
- sum of all valuescount
- total count of values
- destructure_tdigest(digest: tdigest<double>) -> row(means array<double>, counts array<integer>, compression double, min double, max double, sum double, count bigint)¶
Destructures a T-digest into its component parts, returning a row containing:
means
- array of centroid meanscounts
- array of centroid countscompression
- compression factormin
- minimum valuemax
- maximum valuesum
- sum of all valuescount
- total count of values
- merge(tdigest<double>) tdigest<double> ¶
Merges all input
tdigest
s into a singletdigest
.
- merge_tdigest(digests: array<tdigest<double>>) tdigest<double> ¶
Merges an array of T-digests into a single T-digest.
- quantile_at_value(digest: tdigest<double>, value: double) double ¶
Returns the approximate quantile (percentile) of the given
value
based on the T-digestdigest
. The result will be between zero and one (inclusive).
- quantiles_at_values(digest: tdigest<double>, values: array<double>) array<double> ¶
Returns the approximate quantiles (percentiles) as an array for each of the given
values
based on the T-digestdigest
. All results will be between zero and one (inclusive).
- scale_tdigest(digest: tdigest<double>, scale: double) tdigest<double> ¶
Scales the T-digest
digest
by the givenscale
factor. This multiplies all the centroid values in the T-digest by the scale factor.
- tdigest_agg(x: double) tdigest<double> ¶
Returns the
tdigest
which summarizes the approximate distribution of all input values ofx
. The default compression factor is100
.
- tdigest_agg(x: double, w: double) tdigest<double>
Returns the
tdigest
which summarizes the approximate distribution of all input values ofx
using per-item weightw
. The default compression factor is100
.
- tdigest_agg(x: double, w: double, compression: double) tdigest<double>
Returns the
tdigest
which summarizes the approximate distribution of all input values ofx
using per-item weightw
and the specified compression factor.compression
must be a positive constant for all input rows. The default is100
, maximum is1000
, and values lower than10
are rounded to10
. Higher compression means more accuracy at the cost of more memory.
- trimmed_mean(digest: tdigest<double>, low_quantile: double, high_quantile: double) double ¶
Returns the mean of values between
low_quantile
andhigh_quantile
(inclusive) from the T-digestdigest
. Both quantile values must be between zero and one (inclusive), andlow_quantile
must be less than or equal tohigh_quantile
.
- value_at_quantile(digest: tdigest<double>, quantile: double) double ¶
Returns the approximate percentile value from the T-digest
digest
at the givenquantile
. Thequantile
must be between zero and one (inclusive).
- values_at_quantiles(digest: tdigest<double>, quantiles: array<double>) array<double> ¶
Returns the approximate percentile values as an array from the T-digest
digest
at each of the specified quantiles given in thequantiles
array. All quantile values must be between zero and one (inclusive).