For what it’s worth, the benchmark on the Zstandard homepage[1] shows none of th...

jhgg · on Sept 20, 2024

One thing to note is that on a given gateway server there are potentially 100k other compression contexts active, and given each connection is transmitting a trickle of small data in an unpredictable way, from different CPU cores as the processes are scheduled by the erlang VM, chances are the CPU caches are absolutely being thrashed. I imagine this contributes to some level of fixed overhead here too, especially when you're measuring these timings on a machine serving actual production traffic as opposed to simply running a bunch of small payloads through a single compressor.

mananaysiempre · on Sept 20, 2024

It’s possible, I guess, but it wouldn’t be my first thought. It’s too slow for that.

A payload of 1.6KB at 45us/B is 75ms, which is below the typical scheduling quantum of about 100ms. (Can’t say anything about Erlang, let alone Erlang bindings to C libraries, but I wouldn’t expect it to be that much smaller either, precisely because of the switching overhead both direct and indirect.) So a single compression operation shouldn’t be getting preempted enough to affect the results.

Typical RAM bandwidth is tens of GB/s (even consumer-class SSDs[1] are single-digit GB/s) so with tens to hundreds of cores that’s not enough to affect anything, and even taking into account the compressor’s window is measured in megabytes not kilobytes that’s likely not enough (it would be a bad compressor that reread its whole window each time, anyway). And the data we’re compressing is not only minuscule, it has just been generated and is virtually guaranteed to be cached.

Honestly, I almost want to say that the benchmark is measuring the wrong thing somehow, except they’re reporting a 2× speedup switching from one compressor to another. So it can’t be the JSON encoding overhead or whatnot, and, unless one of the Erlang bindings is somehow drastically stupider than the other, it shouldn’t be the FFI overhead, and even those are a huge stretch. The Flying Spaghetti Monster be merciful, I cannot see anything here that we could be spending over a hundred million cycles on.

At this point I’m hoping somebody just mixed up the units, because this is really unsettling.

[1] https://lemire.me/en/talk/perfsummit2020/