Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning. That means on average you'd only discard 25% of the array after each partition. Your selected pivot is either below the median, above the median, or exactly the median. If it's below the median, it could be anywhere in the range [p0, p50) for an average of around p25; if it's above the median, it could be anywhere in the range (p50, p100] for an average of around p75.

Since these remaining fractions combine multiplicatively, we actually care about the geometric mean of the remaining fraction of the array, which is e^[(integral of ln(x) dx from x=0.5 to x=1) / (1 - 0.5)], or about 73.5%.

Regardless, it forms a geometric series, which should converge to 1/(1-0.735) or about 3.77.

Regarding using the average as the pivot: the question is really what quantile would be equal to the mean for your distribution. Heavily skewed distributions would perform pretty badly. It would perform particularly badly on 0.01*np.arange(1, 100) -- for each partitioning step, the mean would land between the first element and the second element.



> If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning.

Only in the first iteration. There’s a good chance it will be in the smaller one in the second iteration, for example.

So, your analysis is a bit too harsh, but probably good enough for a proof that it’s O(n) on average.

> Heavily skewed distributions would perform pretty badly

That’s why I used the weasel worlds “real world data” ;-)

I also thought about mentioning that skew can be computed streaming (see for example https://www.boost.org/doc/libs/1_53_0/doc/html/accumulators/...), but even if you have that, there still are distributions that will perform badly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: