> Each time the [std::deque] slides, every value in the queue has to change posi...

sudara · on Nov 10, 2023

Thanks for the proofread! I had so much text juggling between this and the README that it was guaranteed some things would fall through the cracks! I updated the things you mentioned about `stackSum` and thanks for the catch on the `sumIn` definition.

> It couldn't provide this if it was moving every value in the queue.

I actually don't remember anymore what std::deque does under the hood, I did look into it, but the only thing I remember is that it was quite slow!

> You'd have to template the code on `radius` instead of passing it in as a runtime parameter so that the compiler could lower the divisions to bitshifts.

Yes, I really like this idea. Especially because radii only really vary between 1-48px for most drop shadow needs. It would be nice to have a handful of the common radii be ripping fast.

klyrs · on Nov 10, 2023

std::deque will never beat your hand-rolled ring buffer. It's a container, it will spend time managing memory every time you walk off the end of a chunk. If it's implemented right*, it will hold onto a chunk which fell off, and will only allocate twice. If it's implemented wrong*, it will allocate after `chunksize` pushes, and free every `chunksize` pops (which might be handled properly by your allocator). But it's still going to need to shuffle those chunks around, adding unnecessary overhead because you're using a ring in a way that is a perfect fit for your application: you use the value you're popping every time you push.

* for your purpose! Therein lies the challenge of writing standard libraries... choices must be made.

vitus · on Nov 10, 2023

> std::deque will never beat your hand-rolled ring buffer.

std::deque may never beat a hand-rolled ring buffer of fixed size that maintains cursors, but it will beat a hand-rolled linked-list implementation.

The typical implementation will generally exhibit better cache locality than a linked list, and will outperform it for random access as required by the specification (amortized constant, vs linear).

(The typical implementation per cppreference, although I don't think this is formally required by the spec, is a sequence of fixed-length arrays.)

klyrs · on Nov 11, 2023

After some thought, I figured out a way to implement a chunked array of the sort specified by the c++ standard library, which does not require reallocation if the queue length stays below the chunksize. It involves storing cyclic offsets for each chunk.

So what I wrote isn't quite correct about needing to fiddle with memory regardless. But there is still overhead -- even if there's well-optimized code for the single chunk case, the container needs to check that condition.

sudara · on Nov 10, 2023

Thanks for this!