Shift-to-Middle Array: A Faster Alternative to Std:Deque?

ksherlock · 2025-03-23T23:58:33 1742774313

A couple notes looking at the c++ implementation

- this is going to have problems with non-trivial types. (Think about destructors or move constructors like std::unique_ptr). If you don't want to deal with them, at least add a static_assert(std::is_trivially_copyable<T>::value == true);

- front() doesn't return a reference and it doesn't even return the front

- adding iterators (begin()/end()) will let it play nice with for( : ) loops and <algorithms>, etc.

usefulcat · 2025-03-24T14:47:53 1742827673

Note that this implementation (I looked at the c++ code) will repeatedly double the amount of allocated memory, even if the usage of the queue is such that it never contains more than one item at a time. It's not much different from a memory leak.

    void insert_head(const T& value) {
        if (head == 0) resize(); // <= resize() will double the allocated memory
        data[--head] = value;
    }

It looks like there is a fix for this behavior in resize() (to avoid repeated reallocation when the queue size is small relative to capacity), but it is currently commented out..

severino · 2025-03-24T08:48:38 1742806118

A little off-topic, but is it usual in C++ to have a header (.h) and a source (.cpp) where the attributes and most of the methods are identical in both files, but with some more methods in the header file?

actionfromafar · 2025-03-24T09:04:04 1742807044

Well the code should not be duplicated, only method signatures, but yes.

It’s very common.

Edit: after a while you don't even think about it (and of course, there are reasons for it) but sometimes I pause and think. It didn't have to be this way. Some C++ libraries are what's called "header only" which makes them very easy to integrate into your own code. Downside is that it may take longer to compiler your code. (And here lies a clue to why things are that way. The header tells the compiler how your other code may interface with the code, without having to know exactly what goes on in the code.) There have been attempts¹ to do away with the split between header and code, which would make C++ a bit more like C# or Java for instance in that respect. In newer versions of C++ there is the "module"² concept which I don't know much about but which can achieve something similar.

1: https://sourceforge.net/projects/lazycplusplus/

2: https://en.cppreference.com/w/cpp/language/modules

bogwog · 2025-03-24T14:27:48 1742826468

Modules 100% solve this problem, but broad compiler support is still (frustratingly) lacking.

Some day, a class like in the OP will be implementable in a single file. The compiler will compile that once, and users can 'import' it infinitely without worrying about the usual header inclusion pitfalls, and without incurring any compile time overhead. The amount of electricity saved from the reduced compilation work will save us from the current climate disaster, and we'll become a maximally productive society unburdened by slow C++ compilers.

severino · 2025-03-24T10:50:54 1742813454

Thanks, what you explain in your comment is the idea I had, too, although I've little experience in C++. But I was confused after taking a look at this project's source and seeing all the duplicated code between ShiftToMiddleArray.h and ShiftToMiddleArray.cpp, and not only signatures. I wasn't sure if that was done for some purpose.

freeone3000 · 2025-03-24T13:54:25 1742824465

You need the definitions in the header for templates (which look like what other languages call generic types), because the source of the method needs to be available to effect the template substitution. (The idea behind headers being you can compile against a header and a compiled object, like a shared library.)

mystified5016 · 2025-03-24T13:57:23 1742824643

Typically you declare a member in the header and define it in the CPP. But you can also freely write definitions in your header.

You cannot define the same member twice, tough.

In an ideal universe, your header contains only declarations for functions which are defined elsewhere. If you define something in your header, it should be something intended to be accessed without the CPP. Say, a utility function to give you a string describing an error code.

In reality, because there are no hard rules, people do anything. You get definitions mixed into headers and such.

Look at it this way, each CPP file is intended to be an isolated compiled object. The header defines the ABI you use to talk to that object. And members defined in your header get copied into other CPP files and also compiled there. You want all reusable code to go into a separate compilation so it's not duplicated all over your binary.

sumtechguy · 2025-03-24T14:54:37 1742828077

> In reality, because there are no hard rules, people do anything. You get definitions mixed into headers and such.

All of that is done so forward declaration works.

The problem is #include does just what it says on the tin. It includes whatever is in the file into the file the #include is in. By convention that is .h/.hpp for headers. But there is nothing saying it can not be something like #include<'somerandom.jpg'> or even another .cpp file (seen it). Now that probably will not compile. But the pre-processor will include it at the spot you say. Then promptly barf on it because it does not parse.

The compiler says anything you declare though needs to be defined. Usually a built in type, or class, or struct, or typedef. Basically defined before use. So technically I can glom all of my stuff together and if I get it in the right order I could have one giant file and zero new headers. But we like our class/function files to be semi organized so forward declaring items is the norm.

C++ adds a bit of a twist on all of this. In that a class file does not have to be all in one spot. It can be in a header or smeared across 20 other files. The one rule the linker needs is hey is this declared before you use it. That way the linker can eventually find the right code to call.

To understand the 'why' you have to understand the linker and preprocessor work together to make it happen.

epcoa · 2025-03-24T17:03:39 1742835819

> The compiler says anything you declare though needs to be defined.

No it doesn’t. If that were true dynamic linking would not be a thing, but it’s not even true in the most basic way either.

> Basically defined before use.

Also not true. But this is “not even wrong” since “before use” isn’t defined here, however if it is meant to be appears before in the input, then that is wrong.

> The one rule the linker needs is hey is this declared before you use it. That way the linker can eventually find the right code to call.

Because of the way C++ is defined this is somewhat true (not so much for incomplete types), but also a tortured avoidance of the compiler role (Translation phase 7) in the process, and really the meat of what is going on. I’d recommend someone just read cppreference first.

https://stackoverflow.com/questions/1410563/what-is-the-diff...

actionfromafar · 2025-03-24T14:18:20 1742825900

Just want to add the tiny nitpick that there's no C++ "law" that the end result must be code duplicated in the binary. It's just that it may require link time optimizations and untangling which in practice is not done, so you'd end up with duplicates, after all.

AttilaT · 2025-03-23T23:20:27 1742772027

I recently developed a new data structure called the Shift-To-Middle Array, designed as an alternative to std::deque, std::vector, and linked lists. My goal was to optimize insertion and deletion at both ends, while also improving cache locality and performance compared to traditional implementations.

What is the Shift-To-Middle Array? Unlike std::deque, which uses a fragmented block-based structure, the Shift-To-Middle Array maintains a contiguous memory layout. Instead of shifting elements inefficiently (like std::vector), it dynamically redistributes free space toward the middle, reducing unnecessary data movement.

Key Features: Fast insertions & deletions at both ends (amortized O(1)) Efficient cache utilization (better than linked lists) Supports fast random access (O(1)) No pointer chasing (unlike linked lists) Parallelization & SIMD optimizations possible

Performance Benchmarks I benchmarked Shift-To-Middle Array vs. std::deque vs. ExpandingRingBuffer vs. std::queue across different workloads. Some highlights:

Push-heavy workload → Shift-To-Middle Array showed improved insertion performance over std::deque.

Pop-heavy workload → Showed improvements in memory access and removal operations.

Random insert/remove workloads → Demonstrated better cache efficiency compared to linked lists.

(Full benchmarks and source code available below.)

When Should You Use It? High-performance queue-like structures

Game engines (handling real-time events efficiently)

Networking applications (handling packet buffers)

Dynamic sequences (e.g., computational geometry, physics sims)

Would love to hear thoughts and feedback from the community! Have you encountered similar performance bottlenecks with std::deque or other dynamic structures?

pclmulqdq · 2025-03-24T13:55:24 1742824524

The C++ std::deque usually uses blocks that are big enough not to worry about memory concerns. The linked list has bad performance because you have tiny blocks of memory, but blocks of 256 objects are usually big enough that they are contiguous for all practical purposes (paging, allocation, and caching). Libc++, the library associated with clang and the LLVM project, uses a block of 4096 objects in its deque.

This is an alternative that has been shown in the literature many times before, and it works well for certain access patterns, but is a major waste of resources for others. Yours in particular is great when you are pushing/popping both sides equally. The C++ standard deque is made for unknown but unequal directions of push/pop (while still having ~O(1) random access) with 50/50 ratio of push and pop.

scottlamb · 2025-03-24T15:25:25 1742829925

> I recently developed a new data structure called the Shift-To-Middle Array, designed as an alternative to std::deque, std::vector, and linked lists.

I notice you do not include ring buffers in this headline alternatives list. To me ring buffers seem the most natural comparison. I'd expect them to perform strictly better (no movement, no amortized constant time). But parsing APIs often can't handle the discontinuity where they wrap around, so I think this or something like it has value.

I do see you included this `ExpandingRingBuffer` in your benchmarks, and wrote the following:

> ExpandingRingBuffer performs well for small to medium container sizes but becomes less efficient for larger sizes, where Shift-To-Middle Array and std::deque maintain better performance.

Why do you think ExpandingRingBuffer's performance suffers? Is this about frequent expansion? Otherwise, as mentioned above, I'd expect a well-implemented ring buffer to be hard to beat.

evjohnst · 2025-03-24T06:15:57 1742796957

This is pretty cool, thanks for sharing!

It looks like your benchmarks only cover average operation time. Have you thought about benchmarking p99 latency as well? I would expect insertions that cause reallocations to be slow enough that it might be an issue for some usecases.

MattPalmer1086 · 2025-03-24T09:08:29 1742807309

At the very least, computing the standard deviation as well as the mean average would be useful to get some idea of the variation.

The median, rather than the mean would also generally be a better guide to performance in practice. Even better, give the mean, median and std deviation.

dehrmann · 2025-03-24T00:12:32 1742775152

> Unlike std::deque, which uses a fragmented block-based structure

I always assumed deque implementations were ring buffers that double in size once full so that prepend/append operations are amortized O(1).

tialaramex · 2025-03-24T01:16:26 1742778986

No, that's what Rust's VecDeque is, the C++ std::deque is something which you wouldn't invent today, but alas C++ is wedded to how things were done in the 1990s sometimes before.

std::deque does have practical uses, but they're rare and many implementations aren't well suited even to those uses. Unlike VecDeque most people should just ignore it.

gpderetta · 2025-03-25T09:45:39 1742895939

> the C++ std::deque is something which you wouldn't invent today,

Why is that?

My major issue with std::deque is that the standard implementation doesn't provide a) a way to define the block size, b) a way to access the single blocks for optimizations. The deque in boost.container provides the former. I don't know if any widely available implementation provides the latter (segmented iterators were first proposed around the original C++ standardization, but it seems that were never picked up).

TeMPOraL · 2025-03-24T08:11:29 1742803889

IDK, I thought the whole point of a deque was to string vectors into a linked list, so you get benefits of both, the most important ones being 1) cheap random access, and 2) insertion doesn't move stuff around in memory. I.e. that the deque's "vectors connected by pointers" is not an implementation detail, but the very nature of the beast.

Maybe I just took the boxes-and-arrows diagrams from C++ books too seriously.

widdershins · 2025-03-24T08:39:59 1742805599

Yes, I agree, insertion not moving things is a very useful feature of deques. It allows you to keep items with deleted copy and move constructors in a container that has good cache locality.

amluto · 2025-03-24T00:40:30 1742776830

deque does not move elements once added, so it can’t be implemented like that.

rowanG077 · 2025-03-24T01:14:51 1742778891

Is that an actual guarantee or just the current implementation? I'd expected that at the very least some functions that can operate on std::deque must require moves. erase_if immediately comes to mind.

dzaima · 2025-03-24T02:05:39 1742781939

https://en.cppreference.com/w/cpp/container/deque:

> When inserting at either end of the deque, references are not invalidated by insert and emplace.

> push_front, push_back, emplace_front and emplace_back do not invalidate any references to elements of the deque.

rowanG077 · 2025-03-24T02:10:29 1742782229

That's a guarantee of some specific operations on deque not a general requirement on the entirety of deque. I looked at your link and erase_if does not have that requirement. So I'd imagine that can and does invalidate references aka move elements.

dzaima · 2025-03-24T02:32:01 1742783521

Right, indeed, amluto's comment is incorrectly broad. But it's still a thing that a ring buffer cannot guarantee, but std::deque impls must.

acmj · 2025-03-24T01:21:11 1742779271

std::deque typically uses chunked arrays. It is more complex but tends to be faster than a ring buffer based implementation.

mythmon_ · 2025-03-24T15:20:33 1742829633

That's what I learned in my university data structures course. I don't know anything about C++'s std::deque though.

MattPalmer1086 · 2025-03-24T09:02:48 1742806968

Really interesting, I love new ideas for data structures.

One little note about benchmarking though. It's hard to get good benchmark data, particularly in Java due to the JIT compiler. At the very least, you should perform a large number of warmups (e.g. 10000 calls to the code) before actually benchmarking to ensure all code is fully compiled by the JIT. That's only one of the gotchas though. Even better, use a dedicated Java benchmarking system like jmh.

gpderetta · 2025-03-25T09:49:44 1742896184

There is no JIT in a typical C++ implementation. The first run might be an outlier due to warming up multiple caches, but it can be obviated with a large enough run. Runtime should be relatively stable for synthetic benchmarks.

MattPalmer1086 · 2025-03-25T14:01:56 1742911316

There is a Java implementation as well with benchmarks. But yes, C++ has no JIT, although doing some warm up is also advised for benchmarking in general.

threeducks · 2025-03-24T10:02:31 1742810551

> I recently developed a new data structure

Congratulations, you have discovered the array deque!

https://en.wikipedia.org/wiki/Double-ended_queue#Implementat...

I'd like to point out a performance optimization. On resize, you create a new array with double the size and copy all the old elements to the middle of the new array with more space at the beginning AND at the end: https://github.com/attilatorda/Shift-To-Middle_Array/blob/05...

However, in practice, it is often the case that most operations append elements at either the front OR back, but rarely at both ends equally. For example, imagine a FIFO queue, where elements are popped from the front and pushed from the back. Therefore, it is more efficient to only reserve space at the back if the last operation was push_back or at the front if the last operation was push_front.

One could also carry along statistical information about the number of push_back and push_front operations and balance the space allocated at the front and back accordingly.

In addition to ksherlock's points, there are also the following issues:

- Vector implementations usually use size_t instead of int. Your implementation will fail for arrays larger than 2147483647, while size_t usually goes up to 18446744073709551615.

- On memory allocation failure, you should throw std::bad_alloc instead of calling std::exit.

- The front() function is missing a return statement. Turn on compiler warnings: -Wall -Wextra. For testing, -g -fsanitize=address,undefined is also helpful.

- You are mixing delete[] with malloc'ed memory in shrink_to_fit.

- You should implement copy constructor, move constructor, copy asignment and move assignment functions. See rule of five: https://en.cppreference.com/w/cpp/language/rule_of_three#Rul...

- Switching features on or off is usually done with macros instead of comments.

- 2 is not the best growth factor because it makes it harder for the memory allocator to reuse memory. More modern implementations use smaller growth factors: https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor

- A more reasonable initial capacity would be 0 instead of 16, as is common for all major std::vector implementations. An initial capacity of 16 wastes a lot of space when a large number of ShiftToMiddleArrays are allocated.

- The compiler will most-likely ignore your inline instructions and decide on its own whether inlining is done or not, so might as well remove them.

- Why are there two versions of the data structure? (ShiftToMiddleArray.cpp and ShiftToMiddleArray.h)

- Some functions are just wrappers for other functions (pop_front = remove_head, pop_back = remove_tail). Why?

LLMs can point out most of those issue, so you should use them to discover potential issues. Of course, make sure to double-check with more reliable sources once you know that a certain class of problems exists.

fearthetelomere · 2025-03-24T16:58:55 1742835535

>- 2 is not the best growth factor because it makes it harder for the memory allocator to reuse memory. More modern implementations use smaller growth factors: https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor

I've often wondered about this, so I'm curious to learn more. I agree in principle we should be more clever to ensure we have better memory use. However, half the implementations in the list you've linked use a growth factor of 2, so I'm confused about your point.

If it's not the best, what is? Do you know why these implementations opt for 2 if it's not the best choice?

SkiFire13 · 2025-03-24T19:45:14 1742845514

The better memory usage argument is not as one sided as it may appear. The issue related to using a growth factor of 2 appear when you have a single very big array, but for many smaller ones it's not really an issue. Meanwhile allocators generally tend to like power of 2 sizes, and non-2 growth factors produce non-power of 2 sizes. Some data structures are also easier to implement when the size is a power of 2 because you can efficiently wrap around by using bitwise operations rather than an expensive remainer.

masklinn · 2025-03-25T09:30:52 1742895052

> The issue related to using a growth factor of 2 appear when you have a single very big array

And a linear coalescing allocator of which that collection is essentially the only user.

That is, the allocator needs to be a single bytes array, and it needs to be able to reuse and merge freed allocations, and there can’t be other objects being allocated between your collection’s allocations (or they need to be freed before your collection needs to realloc).

threeducks · 2025-03-24T17:52:49 1742838769

> Do you know why these implementations opt for 2 if it's not the best choice?

Probably a mix of simplicity, not knowing or not caring. Most software is not optimal.

> If it's not the best, what is?

In theory, the best value is a bit less than the golden ratio, so 1.5 is quite good.

https://archive.li/Z2R8w#selection-119.7-135.119

In practice, unknown factors can influence the result, so it is best to benchmark your code and try a bunch of values until you find the fastest configuration.

dzaima · 2025-03-24T23:52:20 1742860340

That theoretical result of the golden ratio is completely entirely useless for any practical scenario, and, even for its extremely-contrived scenario where the only allocations ever happening are from a single array being expanded, and using an extremely dumb allocator, it's just optimizing memory usage, not performance.

Amusingly, it looks at Java, which typically uses pointer bumping for the allocator and can do compacting GC, making the argument entirely meaningless.

gpderetta · 2025-03-25T09:52:40 1742896360

IIRC libstdc++ benchmarked the golden ratio and found it worse for the loads they were interested in, so they stuck with a growth factor of 2.

qwe----3 · 2025-03-24T10:11:35 1742811095

- The compiler will most-likely ignore your inline instructions and decide on its own whether inlining is done or not, so might as well remove them.

There are compiler specific attributes that can really force this. Of course, it's worth doing benchmarks and looking at the generated assembly to see if this is necesary

ntonozzi · 2025-03-23T23:58:01 1742774281

Sounds very cool! How do you implement efficient random deletes?

oasisaimlessly · 2025-03-24T04:40:44 1742791244

std::vector doesn't support efficient random deletes, so this (a generalization of std::vector along one specific axis) won't either.

Efficient random deletes and contiguous hole-free storage are completely at odds with each other.

sensen7 · 2025-03-24T12:11:47 1742818307

They aren't: https://github.com/GossiperLoturot/densemap

oasisaimlessly · 2025-03-24T13:49:24 1742824164

You can't refer to this collection via a slice (std::span / &[T]), so it's not what I meant when I said "contiguous hole-free storage".

wffurr · 2025-03-24T12:43:06 1742820186

That’s a map, not an indexed array, and it’s slower than Vec across the board per its own benchmarks.

sensen7 · 2025-03-24T12:52:51 1742820771

Yes, it's a map that provides efficient random deletes and contiguous hole-free storage, proving that they are not completely at odds with each other.

If you don't care about the map part, you can get the same behavior by just moving the last element in the place of the newly removed element. This invalidates all indices, which is what the DenseMap's overhead is meant to avoid, but Vec's remove also invalidates indices. Vec's remove is strictly worse except that it preserves the ordering if the Vec is sorted.

eru · 2025-03-24T02:54:29 1742784869

The current implementation doesn't.

You could add that functionality via 'tombstones': when you delete an element, you replace it with a 'tombstone' marker in the structure.

Whenever you clean up the structure (eg for a resize), you skip the tombstones when you copy the old contents over.

dzaima · 2025-03-24T02:56:29 1742784989

That'd break arbitrary reads being contiguous O(1) though.

eru · 2025-03-24T13:54:02 1742824442

Yes, that's true. Though you could still access front and back in amortised O(1).

bobmcnamara · 2025-03-24T11:33:58 1742816038

There's a similar data structure sometimes seen in video games that does this, but the name escapes me.

sensen7 · 2025-03-24T12:54:30 1742820870

You're probably thinking of "slabs" or "slotmaps"

pklausler · 2025-03-24T16:01:50 1742832110

This is like the opposite of a gap buffer.

yencabulator · 2025-03-25T19:43:01 1742931781

A degap buffer.

queue : deque :: gap : degap

(Use a ringbuffer, instead.)

rwbt · 2025-03-24T00:07:27 1742774847

AFAIK, Apple's CoreFoundation CFArray also works similarly[0]. NSMutableArray works little differently (using a circular buffer). From the always excellent Cichenowski[1].

[0] - https://github.com/opensource-apple/CF/blob/master/CFArray.c [1] - https://ciechanow.ski/exposing-nsmutablearray/

pizlonator · 2025-03-24T16:41:05 1742834465

JavaScriptCore uses this technique for JavaScript arrays that are used as deques, like if you unshift/shift.

Here's the core data structure: https://github.com/WebKit/WebKit/blob/main/Source/JavaScript...

The logic that makes it work is in JSArray.cpp and other files in that directory.

Shift-to-middle is surprisingly performant and also surprisingly hard to get right.

theamk · 2025-03-24T19:02:28 1742842948

I don't see why I would use it over `std::deque`. It has all the same complexity properties, but better tested, included in stdlib, and supports complex objects. It even has OK cache locality, given it allocates data in large-ish blocks.

You should really include this in you summary table, because it'd have all the same values compared to shift-to-middle array.

And your benchmark confirm this: figure 3 does not show the raw data, but it looks like std::queue may be 8-10% slower on smaller data sizes, and 1-2% slower on larger data sizes. Such small and inconsistent differences do not indicate different O()-complexity, and likely very dependent on specific benchmark design.

Related: std::deque implementation details for various compilers: https://devblogs.microsoft.com/oldnewthing/20230810-00/?p=10...

jzwinck · 2025-03-24T20:55:32 1742849732

You say deque uses large-ish blocks but you provide documentation that it uses 512 byte blocks on GCC and MSVC is even worse. So if you're on Windows the blocks are so small the container degenerates into something like a std::list, and on non-Windows it only works well if your objects are a few bytes each.

theamk · 2025-03-25T00:19:34 1742861974

The gcc's 512 byte block is actually not that bad.

For cache locality, you want block size that is bigger that cache line size - and those are 64 to 128 byte range. And as for overhead, it seems like 1 pointer per block, or 1.5% for deque of pointers/integers - not a very big value. So yeah, while linked lists are bad, gcc's deque is pretty OK.

For MSVC, I agree, their std::deque is pretty bad. But the title of the post isn't "a faster deque MSVC", it makes no distinction between OSes.

pcwalton · 2025-03-24T00:18:46 1742775526

Interesting alternative idea I thought of just now: a data structure that works like VecDeque (a circular buffer) but uses mmap to map two views onto the same pages right after one another. That would ensure that the entire array can be accessed in a consecutive fashion, no matter where it gets split, without any copying. The downside is that reallocation would be really slow, involving multiple syscalls, and the minimum size of the array would be 4kB or more, depending on the page size.

duped · 2025-03-24T00:44:16 1742777056

I've seen this trick used around, where it really shines is when you want to prepare/commit a range of the ring buffer when interfacing with something that wants a contiguous chunk as an arg, using the mmap hack lets you pass any pointer into the ring buffer without needing to split it to handle the wraparound case.

There are a few blog posts out there about it, eg https://lo.calho.st/posts/black-magic-buffer/. One data structure that works around the limitations is the bip buffer: https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The.... In that article the author talks about the mmap trick.

scottlamb · 2025-03-24T14:27:08 1742826428

I don't think the bip buffer solves a real problem. Let's say I'm using it as a read buffer.

> The upshot of all of this is that on average, the buffer always has the maximal amount of free space available to be used, while not requiring any data copying or reallocation to free up space at the end of the buffer. ... Another possibility which was brought up in the bulletin board (and the person who brought it up shall remain nameless, if just because they... erm... are nameless) was that of just splitting the calls across wraps. Well, this is one way of working around the wrapping problem, but it has the unfortunate side-effect that as your buffer fills, the amount of free space which you pass out to any calls always decreases to 1 byte at the minimum - even if you've got another 128kb of free space at the beginning of your buffer, at the end of it, you're still going to have to deal with ever shrinking block sizes.

So it maximizes the contiguous free bytes. I feel like the author just never knew about readv? Passing a couple iovecs completely solves this problem in a much better way.

What seems far more valuable for the used space to be contiguous, as parsing APIs often expect this. bip buffers don't offer that, right?

Now let's say I'm using it as a write buffer. I've never had the problem of needing it to be contiguous on either side. On the input side, I could imagine some application API that really wants to write into a contiguous buffer, but it hasn't been my experience. On the output side, there's writev.

pcwalton · 2025-03-24T03:16:04 1742786164

Oh, I see, it's actually on Wikipedia [1]. I figured I wasn't the first one to invent the idea :)

[1]: https://en.wikipedia.org/wiki/Circular_buffer#Optimization

o11c · 2025-03-24T01:50:14 1742781014

For C++, that's only valid for some subset of types, which currently can't be expressed with type traits. "Address-free" has a close enough definition in the context of atomics.

Trivially moveable types are probably sufficient (at least, I can't construct a case where being trivially copyable is needed), but not necessary; there are many things a special member function can do without caring about the address.

In practice, the main problem is that you can't use private mappings (which are the default and for good reason); you have to use shared mapping, which are very finicky to set up and cause infelicities with `fork`. [This does make me wonder how reflinks/`copy_file_range` interact with `mmap` and the page cache.]

Really, you should just fix all your APIs to take an `iovec` array.

gpderetta · 2025-03-25T10:35:05 1742898905

On linux you can use remap_file_pages[1] to setup your magic ring buffer. Yes, these days it is emulated in the kernel, but here we only care about setting up the expected duplicated mapping of anon pages instead of minimizing the allocations of VMAs.

Also linux has MADV_DONTFORK, would that allay your concerns? (there is a potential race between mmap and the madvice call, but if you are forking in a multithreaded program you have worse concerns).

[1] https://man7.org/linux/man-pages/man2/remap_file_pages.2.htm...

HeliumHydride · 2025-03-24T03:41:20 1742787680

C++26 is getting a trait called "trivial relocatability", which allows you to communicate that a type can be memcpy'd around without fear.

eru · 2025-03-24T02:55:33 1742784933

> In practice, the main problem is that you can't use private mappings (which are the default and for good reason); you have to use shared mapping, which are very finicky to set up and cause infelicities with `fork`. [This does make me wonder how reflinks/`copy_file_range` interact with `mmap` and the page cache.]

Yet another reason fork was never a good design choice.

fc417fc802 · 2025-03-24T08:53:46 1742806426

Is fork really the problem in this scenario?

eru · 2025-03-25T03:43:52 1742874232

I don't think it's the main problem here, but it's just another point where the fork/exec model shows its problems.

fc417fc802 · 2025-03-25T14:30:09 1742913009

What would you replace fork with? I don't get it.

Isn't the desired improvement here to how the kernel handles the lifetime of mappings? It seems orthogonal to me.

yxhuvud · 2025-03-24T07:11:59 1742800319

So essentially pushing the work to the TLB? Well, if there are people building moving GCs that manage to retain stable pointers that way, why not use it for a circular buffer as well.

> The downside is that reallocation would be really slow, involving multiple syscalls,

Scaling ring buffers up and down in size is not very performant anyhow, as a bunch of the elements in it tend to need to be copied.

gpderetta · 2025-03-25T10:36:58 1742899018

This is normally called a magic ring buffer, and it is a relatively well known pattern. Very useful for queues when you need to support variable size elements and need to interoperate with code that can't handle split objects.

leiroigh · 2025-03-24T12:31:35 1742819495

The main problem with that is that it doesn't play nice with most languages. Consider

  int foo(int* ptr) {
    int x = ptr[1<<16];
    *ptr += 1;
    return x + ptr[1<<16];
  }

Compilers/languages/specs tend to decide that `ptr` and `ptr + (1<<16)` cannot alias, and this can be compiled into e.g.

  foo(int*):
        mov     eax, dword ptr [rdi + 262144]
        inc     dword ptr [rdi]
        add     eax, eax
        ret

which gives undesired results if `ptr` and `ptr + (1<<16)` happen to be mapped to the same physical address. This is also pretty shit to debug/test -- some day, somebody will enable LTO for an easy performance win on release builds, and bad code with a security vuln gets shipped.

scottlamb · 2025-03-24T15:42:57 1742830977

I don't think that's a fundamental problem. In say Rust (with its famously strict aliasing requirements), you obviously need some level of unsafe. You certainly want to ensure you don't hand out `&mut [T]` references that alias each other or any `&[T]` references according to either virtual or physical addresses, but that seems totally possible. I would represent the ring buffer with a raw pointer and length. Then for callers I'd construct `&[T]` and `&mut [T]` regions as needed that are never more than the full (unmirrored) length and thus never include the same byte twice. There are several existing Rust crates for the mirrored buffer that (though I haven't looked into their implementations recently to verify) presumably do this: slice-deque, vmcircbuf, magic-ring-buffer, vmap.

I do think though there are some downsides to this approach that may or may not be deal-breakers:

* Platform dependence. Each of the crates I mention has a fair bit of platform-specific `unsafe` code that only supports userspace on a few fixed OSs. They fundamentally can't work on microcontrollers with no MMU; I don't think WASM has this kind of flexibility either.

* Either setting up each buffer is a bit expensive (several system calls + faulting each page) or you have to do some free-listing on your own to mitigate. You can't just rely on the standard memory allocator to do it for you. Coincidentally just like last week I was saying freelisting is super easy for video frames where you have a nice bound on number of things in the list and a fixed size, but if you're freelisting these at the library level or something you might need to be more general.

* Buffer size constraints. Needs to be a multiple of the page size; some applications might want smaller buffers.

* Relatedly, extra TLB pressure, which is significant in many applications' performance. Not just because you have the same region mapped twice. Also that the buffer size constraints mentioned above make it likely you won't use huge pages, so on e.g. x86-64 you might use 4 KiB pages rather than 2 MiB (additional factor of 512x) or 1 GiB (additional factor of 262144x) as the memory allocator would help you do if they could be stuffed into the same huge page as other allocations.

dzaima · 2025-03-25T06:45:32 1742885132

Rust doesn't help here; you necessarily must do all stores in potentially-mirrored memory as volatile (and possibly loads too), else you can have arbitrary spooky-action-at-a-distance issues, as, regardless of &[T] vs &mut [T] or whatever language-level aliasing features, if the compiler can see that two addresses are different (which they "definitely" are if the compiler, for one reason or another, knows that they're exactly 4096 bytes apart) it can arbitrarily reorder them, messing your ring buffer up. (and yes it can do so moving ops out of language-level lifetimes as the compiler knows that that should be safe)

vmcircbuf just exposes the mutable mirrored reference, resulting in [1] in release builds. Obvious issue, but, as my example never uses multiple references with overlapping lifetimes of any form, the issue would not be fixed by any form of more proper reference exposing; it's just simply the general issue of referencing to the same data in multiple ways.

vmap afaict only exposes push-back and pop-front for mutation, so unfortunately I think the distance to cross to achieve spooky action in practice is too far (need to do a whole lap around the buffer to write to the same byte twice; and critical methods aren't inlined so nothing to get the optimizer to mess with), but it still should technically be UB.

slice_deque has many open issues about unsoundness. magic-ring-buffer doesn't build on modern rust.

[1]: https://dzaima.github.io/paste/#0TVDBTsQgFLz3K56XbptsWlo1MWz...

scottlamb · 2025-03-25T16:02:27 1742918547

> Rust doesn't help here; you necessarily must do all stores in potentially-mirrored memory as volatile (and possibly loads too), else you can have arbitrary spooky-action-at-a-distance issues, as, regardless of &[T] vs &mut [T] or whatever language-level aliasing features, if the compiler can see that two addresses are different (which they "definitely" are if the compiler, for one reason or another, knows that they're exactly 4096 bytes apart) it can arbitrarily reorder them, messing your ring buffer up.

Hmm, as I think about it, I see your point about LLVM's optimizer potentially "knowing" memory hasn't changed that really has if it inlines enough even if it's never put into the same &mut [T] as the other side of the mirror (and two improperly aliased &mut [T] are never constructed).

But as an alternative to doing all the stores in a special way (and loads...don't see how doing a volatile store to one side of the mirror is even sufficient to tell it the other side of the mirror has changed)...it'd be far more practical if the caller could use a (not mirrored) &mut [T]. Couldn't you have an std::ops::IndexMut wrapper that returns a guard that has a DerefMut into &mut [T] and on Drop creates a barrier for these kinds of optimizations via `std::arch::asm!("")`? [1] Then LLVM has to assume all memory changed in that barrier.

Regarding the more specific crate issues: I found these crates a while ago and hadn't looked extensively in their implementation. Thanks for pointing these out; I will have to look more closely if/when I ever decide to actually use this approach. I was leaning toward no anyway because of the other factors I mentioned. As an alternative, I was thinking of having a ring buffer + a little extra bit at the end that is explicitly copied from the start as needed. The maximum length of one message I need a contiguous view of is far less than the total buffer size, so only a fraction of the buffer would need to be copied.

> vmcircbuf just exposes the mutable mirrored reference, resulting in [1] in release builds.

Yuck, noted, clearly wrong to give the whole thing as a `&mut [T]`.

> slice_deque has many open issues about unsoundness.

I see at least couple of those, which seem to be "just" the usual unsafe-done-wrong sorts of things (double frees) rather than anything inherent to the mirrored buffer.

[1] https://stackoverflow.com/questions/72823056/how-to-build-a-...

dzaima · 2025-03-25T17:05:08 1742922308

Yeah, an asm marked as memory-clobbering is the proper thing; not the first time I've forgotten that volatile entirely doesn't imply anything to other memory. (in fact, doing "((volatile uint8_t*)x)[0] = 0xaa;" in my godbolt link in a sibling thread still has the optimization happen). Don't know how exactly it interacts with aliasing rules; maybe you'd have to explicitly pass the mutable reference to the asm as an input, otherwise it'd be illegal for the asm to change it and so the compiler can still assume it isn't? or, I guess, not have any references live during the asm call is the proper thing.

Probably indeed possible to do it with proper guards (the pre-pooping your pants issue is probably not a problem if you also have the asm guard in drop?).

> I see at least couple of those, which seem to be "just" the usual unsafe-done-wrong sorts of things (double frees) rather than anything inherent to the mirrored buffer.

Yeah, possible. I was just saying that from the perspective of proving that all the ring buffers not taking extreme care are incorrectly implemented.

scottlamb · 2025-03-25T17:49:04 1742924944

> Don't know how exactly it interacts with aliasing rules; maybe you'd have to explicitly pass the mutable reference to the asm as an input, otherwise it'd be illegal for the asm to change it and so the compiler can still assume it isn't? or, I guess, not have any references live during the asm call is the proper thing.

I don't know either, but really it's the opposite half of the buffer you want to tell it may have changed, so I imagine it doesn't matter even if you still have the `&mut [T]` live.

Maybe the extra guard I described isn't necessary either; the DerefMut could directly return `&mut [T]` but set a `barrier_before_next_access` on the ring, or you could just always have the barrier, whatever performs best I guess.

leiroigh · 2025-03-25T12:33:50 1742906030

>so unfortunately

I see a fellow enjoyer of bugs ;)

>vmap afaict only exposes push-back and pop-front for mutation

what about https://doc.rust-lang.org/nightly/std/io/trait.Write.html#ty... ?

>and critical methods aren't inlined

aren't inlined explicitly. This does not mean that they are not inlined in practice (depending on build options). Also, LLVM can look inside a noinline available method body for alias analysis :(

This is a big pain whenever one wants to do formally-UB shennenigans. I'm not a rustacean, but in julia a @noinline directive will simply tell LLVM not to inline, but won't hide the method body from LLVM's alias analysis. For that, one needs to do something similar to dynamic linking, with the implied performance impact (the equivalent of non-LTO static linking doesn't exist in julia).

dzaima · 2025-03-25T13:41:29 1742910089

> I see a fellow enjoyer of bugs ;)

Yep! :)

I did look at the assembly on a release build and the write method was in fact not inlined (needed to get the compiler to reason about the offset aliasing); that write method is what I called "push-back" there. I could've modified the crate to force-inline, but that's, like, effort, just to make a trivially-true assertion for one HN post.

Indeed a lack of an equivalent of gcc's __attribute__((noipa)) is rather annoying with clang (there are like at least 4 issues and 2 lengthy discussions around llvm, plus one person a week ago having asked about it in the llvm discord, but so far nothing has happened); another obvious problem being trying to do benchmarking.

(for reference, what I was trying to get to happen was an equivalent of https://godbolt.org/z/jobs6M95G)

dzaima · 2025-03-25T05:17:42 1742879862

Volatile stores would fix that issue. But it does mean that it'd be unsafe to lend out mutable references to objects. (maybe you'd need to do volatile loads too, depending on model of volatility)

dzaima · 2025-03-24T03:01:50 1742785310

The ExpandingRingBuffer.h is rather bad for a representation of a ring buffer - it uses modulo for the index calculation, which is pretty damn bad and should really at the very least be masking by a power-of-two.

(it might very-slightly-but-not-really be excusable if the reason was memory utilization.. ..but the resize does "size_t new_capacity = capacity * 2;" so it does doubling anyway. Also, see reply noting that you don't even need power-of-two sizes for fast wrapping, which I managed to completely forget about)

kragen · 2025-03-24T03:21:56 1742786516

Most ring buffers that aren't powers of 2 in size can still get by with i == max ? 0 : i+1 and i ? i-1 : max. On most hardware these will be almost as fast as just i+1 and i-1, while division will be much slower, even on recent hardware.

dzaima · 2025-03-24T03:25:12 1742786712

..yep, true, completely slipped my mind. So modulo is just trivially always the bad choice.

Even for arbitrary indexing "tmp=head+index; buffer[(tmp >= capacity) ? tmp - capacity : tmp]" or so is gonna be better. (assuming the compiler compiles it to something branchless. Or, probably even if it doesn't - division might end up slower than the worst-case of 50% misprediction! And for sequential indexing it's even gonna be predictable.)

jeffzha · 2025-03-23T23:59:07 1742774347

How does this compare to boost::devector?

https://www.boost.org/doc/libs/develop/doc/html/container/no...

TylerGlaiel · 2025-03-24T07:45:37 1742802337

This implementation grows indefinitely if you repeatedly push to the head and remove from the tail, even if the max number of elements in the array is small

IshKebab · 2025-03-24T09:34:39 1742808879

Does it definitely do that? You could easily avoid it by making the "resize" really a move if you don't actually need more space.

I feel like they're over-selling it anyway by comparing to `std::deque` (which is not hard to beat). The only advantage this has over a standard ring buffer (like Rust's VecDeque) is that the data is completely contiguous, but you'll pay a small performance cost for that (regular memmove's when used as a queue), and I'm not sure how useful it is anyway.

bogdan-lab · 2025-03-24T07:13:14 1742800394

The main benefit list gives you comparing to vector is a stable memory. This is often an important property. But lists are slow with an element access. This problem is solved by deque.

Therefore, I argue that alternative to deque has to have a stable memory property. Otherwise, you can just use a vector.

This implementation is trying to do so, btw, but for some reasons it operates with a raw memory under the hood instead of just holding a vector and rotating it here and there. Such approach is unnecessary complicated and error-prone

jtrueb · 2025-03-24T19:04:40 1742843080

Since I often use a sliding window, I made a Rust implementation and was surprised with how performant the VecDeque implementation was.

The MidVec was only faster than VecDeque when doing batch inserts and removals with my implementation.

https://gist.github.com/trueb2/9c0a23aa012f56d4c3d50afe8acf6...

kragen · 2025-03-23T23:58:18 1742774298

Interesting! This is the kind of thing I like.

I'm having a hard time understanding the description. If I understand right, it's kind of like an inside-out gap buffer, or a hybrid of a gap buffer and a ring buffer? Is the free space in the array always contiguous? If not, is the non-free space? How is it different from ExpandingRingBuffer?

Arnavion · 2025-03-24T00:14:46 1742775286

It starts by adding the first element to the middle of the allocation and then the head and tail grow outwards as more elements are added at either end.

Once the head reaches the front of the allocation or the tail reaches the rear of the allocation, it triggers a resize. The resize creates a new allocation with double the size and copies the original elements to the middle of this new allocation.

kragen · 2025-03-24T00:18:41 1742775521

That can't be correct, because then just adding elements at the front and removing them at the rear while maintaining a constant queue size such as 5 elements would trigger an infinite number of resizes. Maybe it only resizes under some circumstances, otherwise copying the live elements back to the middle?

It still seems like that involves copying that a straightforward ring buffer avoids.

akoboldfrying · 2025-03-24T00:56:29 1742777789

I think the "resizes" occurring in this scenario would be to same-size buffers, meaning that the very same buffer could actually be reused. Provided that's in fact what is happening, then yes, this scenario would cause infinite resizes, all of which would be avoided by a ring buffer. But those resizes would still happen rarely enough to meet the amortised complexity claim, namely, 1 resize (copying of n elements) per n elements inserted.

(I'm not at all certain that this is how it actually does work -- the README is light on details. But this is how it might work.)

kragen · 2025-03-24T01:36:39 1742780199

You don't want to always resize to same-size buffers in that situation; consider a size-1024 buffer containing 1023 elements and a free space on the right. Pushing another item on the left can be done in the very same buffer, but requires copying all 1023 items one space to the right. If you then pop another item off the right end you are back to the starting state, so by repeating the process you need 511.5 element copies (in general, O(N) copies) per push or pop.

There are obvious ways to resolve problems like this, but there are tradeoffs among them, and I would like to know which way the author chose and what the resulting complexity is without having to analyze (and debug) 270 lines of C++.

Hmm, there's a PDF at https://github.com/attilatorda/Shift-To-Middle_Array/blob/ma...... but it also doesn't explain things like this. It makes assertions about big-O performance, but doesn't explain the algorithm in enough detail to know whether they are correct.

senderista · 2025-03-24T04:32:29 1742790749

I implemented the "2-level rotated array" structure a few years ago, which isn't designed to be used as a deque, but is at least less pathological than this (O(1) push_back/pop_back, O(sqrt(n) push_front/pop_front/insert/delete).

https://github.com/senderista/rotated-vec

kragen · 2025-03-24T16:14:52 1742832892

Oh, fantastic!

Arnavion · 2025-03-24T01:44:45 1742780685

I read the code before I made my comment. It does exactly what I described.

kragen · 2025-03-24T01:50:30 1742781030

That sounds like pathologically bad behavior, using an amortized-linearly growing amount of memory to hold a constant amount of data? (Moreover, this is perhaps the most common use case for a queue.) It may be a correct description of the code, but it can't be a correct algorithm.

Arnavion · 2025-03-24T02:00:26 1742781626

orlp · 2025-03-23T23:37:40 1742773060

I made something similar to this ~10 years ago: https://github.com/orlp/devector. I never finished it (writing proper containers in C++ is a nightmare [1] [2] [3]), although I did start a similar project in Rust a year or two ago... which I also haven't finished yet (the repo is still private). The double-ended vector is very similar to a regular vector, it can just have free space on both ends:

    <------------ cap_front ------------>
                      <------------ cap_back ------------>
    <-----------------  total_capacity  ----------------->
                      <-----  len  ----->
    <-- space_front -->                 <-- space_back -->
    [                 [    elements     ]                ]
                      ^
                      +--- ptr

In the Rust crate I store 1 pointer and three lengths: len, space_front, space_back for a total size of 32 bytes compared to the usual 24 bytes of Vec.

---

I don't think you always want to shift to the middle. Rather, I propose the following strategy (which I do in the Rust crate, unsure if I did the same in C++ implementation):

1. When a request is made for more free space on one side, check if there is already enough free space, and if not,

2. Compute an amortized growing capacity (e.g. double the current capacity), and take the maximum of that with the requested capacity. While doing this ensure you only take into account the capacity of the side you want more space on (e.g. cap_back in the above picture when growing the back),

3. Check if halving the free space on the other side is sufficient to satisfy the amortized request, if yes, do not reallocate and just shift the values internally, otherwise,

4. Allocate a new buffer with the computed capacity, plus the same amount of free space on the other side and copy over the values.

The above strategy ensures you will not exceed 3N space (with doubling space on grow) even when the double-ended vector is used in a LIFO pattern. For example a regular Vec which doubles its size has a 2N total space worst-case.

[1] https://stackoverflow.com/questions/26902006/may-the-element... [2] https://stackoverflow.com/questions/27453230/is-there-any-wa... [3] https://stackoverflow.com/questions/26744589/what-is-a-prope...

beached_whale · 2025-03-24T01:02:37 1742778157

Boost has a double ended vector with that name too. Devector

nurettin · 2025-03-24T05:26:23 1742793983

Try to turn everything into arrays. Maps, hashmaps are convenient, but if possible, sort your data and use parallel arrays. Deques of fixed size turn into ring buffers. At work we have invented several data structures over the years with weird names and they all make use of some trick to shave off memory allocation times when working with time series.

delifue · 2025-03-24T02:43:05 1742784185

If I keep removing one element in front and adding one element on back, then normal ring-buffer deque will involve no copying, but this will keep doing copying to empty space, so its performance could be much worse than deque if the queue is large.

alextingle · 2025-03-24T05:32:07 1742794327

I love how the descriptive text specifically calls it an alternative to a deque, but the complexity comparison pointedly does not compare it to a deque.

hoseja · 2025-03-24T08:31:13 1742805073

I really hate how MSVC made std::deque toxic for cross-platform use, it's such a cool concept.

kistaro · 2025-03-24T00:49:57 1742777397

Isn’t this already implemented by NSMutableArray in Apple SDKs/libraries?

taco9999 · 2025-03-23T23:44:22 1742773462

What benefits does this have over a standard VecDeque?

orlp · 2025-03-23T23:53:54 1742774034

The elements are completely contiguous, which can be nice for passing off (subslices) to other APIs, maximum speed iteration, etc.

ufo · 2025-03-23T23:57:50 1742774270

Does it have to move or resize when one of the sides reaches the end of the array? I presume that would be slower than a ring buffer that only grows when it's completely filled?

manwe150 · 2025-03-24T00:04:31 1742774671

Both are O(1) datastructures, but indexing a ring buffer is slightly more costly compared to this and insertion is slightly more costly for this than a ring. Probably usually works out in favor of this design though for net performance usually?

ufo · 2025-03-24T00:11:38 1742775098

I'd love to see performance numbers for this, if they're available. My hunch is that indexing cost would be about the same.

manwe150 · 2025-03-24T02:05:52 1742781952

They both have an offset, but ring buffers aren’t contiguous so they also need a branch or modulus to handle wrap around. Either can be cheap, but clearly that is strictly more costly than not having the extra operation (even if very little). Only matters for random indexing also, since for mutation the situation is swapped

ufo · 2025-03-24T02:16:10 1742782570

There are many situations where those little differences completely vanish because of instruction pipelining. Only way to know is to actually measure it.

Arnavion · 2025-03-24T00:15:39 1742775339

>Does it have to move or resize when one of the sides reaches the end of the array?

Yes, it resizes when that happens, to double the size.

ufo · 2025-03-23T23:55:50 1742774150

For context: a VecDeque is a ring buffer backed by an array.

pezezin · 2025-03-24T00:31:46 1742776306

You should call it Middle-Out Array for maximum Internet points /jk

boguscoder · 2025-03-23T23:54:04 1742774044

is it just me or benchmarks report link is dead and hence there's no way to see the comparison

codetrotter · 2025-03-24T00:04:45 1742774685

It’s not dead as such but it looks like OP may have left over some links they intended to add.

That one and at least one other simply links to #, which means same page no anchor. And this is commonly done as a placeholder link before you have the link in place you intended to put there.

Whereas a dead link for me would be one that leads elsewhere and results in 404 (page moved, file not yet created, etc) or an expired or not yet registered domain.

But I get what you mean, and I agree OP should update those links to point somewhere :)

nbonaparte · 2025-03-24T01:00:38 1742778038

There’s a PDF with the results in the repo: https://github.com/attilatorda/Shift-To-Middle_Array/blob/ma...

BTW, if OP is reading this, I recommend having the baseline in your plots (e.g. std::deque) as the relative 100%, that way the performance improvement is clear.

NotYourLawyer · 2025-03-24T18:40:26 1742841626

Middle out?