T-Digest Functions¶
T-digest and quantile digest are two older algorithms for estimating rank-based metrics. T-digest generally has better performance than quantile digest and better accuracy at the tails (often dramatically better), but may have worse accuracy at the median depending on the compression factor used. In comparison, quantile digest supports more numeric types and provides a maximum rank error guarantee, which ensures relative uniformity of precision along the quantiles. Quantile digests are also formally proven to support lossless merges, while T-digest is not (though it does empirically demonstrate lossless merges).
T-digest was developed by Ted Dunning and is more restrictive in its type support,
accepting only double type parameters. This contrasts with quantile digest, which
supports a broader range of numeric types including bigint, double, and real,
making quantile digest more versatile for different data types.
Velox uses the modern KLL sketch algorithm for the approx_percentile function, which
provides stronger accuracy guarantees than both T-digest and quantile digest.
The T-digest functions documented here exist primarily to support
pre-existing workloads that have data stored using the T-digest format for backward compatibility.
Data Structures¶
A T-digest is a data sketch which stores approximate percentile information.
The Velox type for this data structure is called tdigest,
and it accepts a parameter of type double which represents the set of
numbers to be ingested by the tdigest.
T-digests may be merged without losing precision, and for storage and retrieval
they may be cast to/from VARBINARY.
Functions¶
- construct_tdigest(means: array<double>, counts: array<integer>, compression: double, min: double, max: double, sum: double, count: bigint) tdigest<double>¶
Constructs a T-digest from the given parameters:
means- array of centroid meanscounts- array of centroid counts (weights)compression- compression factormin- minimum valuemax- maximum valuesum- sum of all valuescount- total count of values
- destructure_tdigest(digest: tdigest<double>) -> row(means array<double>, counts array<integer>, compression double, min double, max double, sum double, count bigint)¶
Destructures a T-digest into its component parts, returning a row containing:
means- array of centroid meanscounts- array of centroid countscompression- compression factormin- minimum valuemax- maximum valuesum- sum of all valuescount- total count of values
- merge(tdigest<double>) tdigest<double>¶
Merges all input
tdigests into a singletdigest.
- merge_tdigest(digests: array<tdigest<double>>) tdigest<double>¶
Merges an array of T-digests into a single T-digest.
- quantile_at_value(digest: tdigest<double>, value: double) double¶
Returns the approximate quantile (percentile) of the given
valuebased on the T-digestdigest. The result will be between zero and one (inclusive).
- quantiles_at_values(digest: tdigest<double>, values: array<double>) array<double>¶
Returns the approximate quantiles (percentiles) as an array for each of the given
valuesbased on the T-digestdigest. All results will be between zero and one (inclusive).
- scale_tdigest(digest: tdigest<double>, scale: double) tdigest<double>¶
Scales the T-digest
digestby the givenscalefactor. This multiplies all the centroid values in the T-digest by the scale factor.
- tdigest_agg(x: double) tdigest<double>¶
Returns the
tdigestwhich summarizes the approximate distribution of all input values ofx. The default compression factor is100.
- tdigest_agg(x: double, w: double) tdigest<double>
Returns the
tdigestwhich summarizes the approximate distribution of all input values ofxusing per-item weightw. The default compression factor is100.
- tdigest_agg(x: double, w: double, compression: double) tdigest<double>
Returns the
tdigestwhich summarizes the approximate distribution of all input values ofxusing per-item weightwand the specified compression factor.compressionmust be a positive constant for all input rows. The default is100, maximum is1000, and values lower than10are rounded to10. Higher compression means more accuracy at the cost of more memory.
- trimmed_mean(digest: tdigest<double>, low_quantile: double, high_quantile: double) double¶
Returns the mean of values between
low_quantileandhigh_quantile(inclusive) from the T-digestdigest. Both quantile values must be between zero and one (inclusive), andlow_quantilemust be less than or equal tohigh_quantile.
- winsorized_mean(digest: tdigest<double>, low_quantile: double, high_quantile: double) double¶
Returns the Winsorized mean from the T-digest
digest. Values belowlow_quantileare replaced with the boundary value at that quantile. Values abovehigh_quantileare replaced with the boundary value at that quantile. The mean is then computed on all values including the replaced tails. Both quantile values must be between zero and one (inclusive), andlow_quantilemust be less than or equal tohigh_quantile.Unlike
trimmed_mean()which excludes tail values entirely,winsorized_meanreplaces them with the boundary values, keeping the total count unchanged.
- approx_winsorized_mean(x: double, low_quantile: double, high_quantile: double) double¶
Returns the approximate Winsorized mean of all input values of
xusing a T-digest sketch. This is a single-pass aggregate that replaces values belowlow_quantilewith the boundary value at that quantile, and values abovehigh_quantilewith the boundary value at that quantile, then computes the mean.low_quantileandhigh_quantilemust be constants between zero and one (inclusive), andlow_quantilemust be less than or equal tohigh_quantile. The default compression factor is100.This function replaces the common two-pass pattern of computing percentile thresholds with
approx_percentileand then capping values withLEAST/GREATESTbefore computingAVG.Example:
-- Old two-pass pattern: WITH t AS ( SELECT APPROX_PERCENTILE(x, 0.99) AS thresh FROM data ) SELECT AVG(LEAST(data.x, t.thresh)) FROM data, t; -- New single-pass equivalent: SELECT approx_winsorized_mean(x, 0.0, 0.99) FROM data;
- approx_winsorized_mean(x: double, low_quantile: double, high_quantile: double, compression: double) double
Like the above, but with a specified compression factor.
compressionmust be a positive constant. The default is100, maximum is1000, and values lower than10are rounded to10. Higher compression means more accuracy at the cost of more memory.
- value_at_quantile(digest: tdigest<double>, quantile: double) double¶
Returns the approximate percentile value from the T-digest
digestat the givenquantile. Thequantilemust be between zero and one (inclusive).
- values_at_quantiles(digest: tdigest<double>, quantiles: array<double>) array<double>¶
Returns the approximate percentile values as an array from the T-digest
digestat each of the specified quantiles given in thequantilesarray. All quantile values must be between zero and one (inclusive).