There are two kinds: - quantile sketches, such as t-digest, which aim to control...

ssfrr · on July 26, 2024

Do these both assume the quantile is stationary, or are they also applicable in tracking a rolling quantile (aka quantile filtering)? Below I gave an algorithm I’ve used for quantile filtering, but that’s a somewhat different problem than streaming single-pass estimation of a stationary quantile.

ted_dunning · on July 26, 2024

Most quantile sketches (and t-digest in particular) do not assume stationarity.

Note also that there are other bounds of importance and each has trade-offs.

T-digest gives you a strict bound on memory use and no dynamic allocation. But it does not have guaranteed accuracy bounds. It gives very good accuracy in practice and is very good at relative errors (i.e. 99.999th percentile estimate is between the 99.9985%-ile and 99.9995%-ile)

KL-sketch gives you a strict bound on memory use, but is limited to absolute quantile error. (i.e. 99.99%-ile is between 99.9%-ile and 100%-ile. This is useless for extrema, but fine for medians)

Cormode's extension to KL-sketch gives you strict bound on relative accuracy, but n log n memory use.

Exponential histograms give you strict bounds on memory use, no allocation and strict bounds on relative error in measurement space (i.e. 99.99%-ile ± % error). See the log-histogram[1] for some simple code and hdrHistogram[2] for a widely used version. Variants of this are used in Prometheus.

The exponential histogram is, by far, the best choice in most practical situations since an answer that says 3.1 ± 0.2 seconds is just so much more understandable for humans than a bound on quantile error. I say this as the author of the t-digest.

[1] https://github.com/tdunning/t-digest/blob/main/core/src/main...

[2] https://hdrhistogram.org/

throwaway_2494 · on July 25, 2024

See also the Greenwald Khanna quantile estimator, an online algorithm which can compute any quantile within a given ϵ.

https://aakinshin.net/posts/greenwald-khanna-quantile-estima...

sevensor · on July 26, 2024

I am so glad I asked. This is a wheel I’ve been reinventing in my head for eighteen years now. I’ve even asked in other venues, why are there no online median algorithms? Nobody knew of even one. Turns out, I was asking the wrong people!

throwaway_2494 · on July 26, 2024

Glad it helped.

I've used online quantile estimators (GK in particular) to very effectively look for anomalies in streaming data.

It worked much better than the usual mean/stddev threshold approach (which was embedded in competing producsts), because it made no assumptions about the distribution of the data.

One thing to note is that GK is online, but not windowed, so it looks back to the very first value.

However this can be overcome by using multiple, possibly overlapping, summaries, to allow old values to eventually drop off.

sevensor · on July 26, 2024

Awesome, you did one! I’ll give it a read.