Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's a simple one I've used before. It's a variation on FAME (Fast Algorithm for Median Estimation) [1].

You keep an estimate for the current quantile value, and then for each element in your stream, you either increment (if the element is greater than your estimate) or decrement (if the element is less than your estimate) by fixed "up -step" and "down-step" amounts. If your increment and decrement steps are equal, you should converge to the median. If you shift the ratio of increment and decrement steps you can estimate any quantile.

For example, say that your increment step is 0.05 and your decrement step is 0.95. When your estimate converges to a steady state, then you must be hitting greater values 95% of the time and lesser values 5% of the time, hence you've estimated the 95th percentile.

The tricky bit is choosing the overall scale of your steps. If your steps are very small relative to the scale of your values, it will converge very slowly and not track changes in the stream. You don't want your steps to be too large because they determine the precision. The FAME algorithm has a step where you shrink your step size when your data value is near your estimate (causing the step size to auto-scale down).

[1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streaming-m...

[2]: https://stats.stackexchange.com/a/445714



The state of the art has moved well beyond these algorithms. See these

https://github.com/tdunning/t-digest https://www.sciencedirect.com/science/article/pii/S266596382... https://arxiv.org/pdf/2102.09299

And, as I mentioned else-thread, exponential histograms are the best choice in almost all practical situations.


I like how straightforward this one is! It’s fast, it’s obvious, and it’s good enough, if you know something about the data ahead of time. I would have reached for this about four years ago, if I’d known about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: