> llvm::SmallVector [...] this class preallocates some amount of data (the actua...

jcelerier · on Jan 30, 2019

> But you need to "traverse" a pointer for a stack-allocated array just as well!

By the time you traverse this pointer, its pointed-to memory location is almost certainly already in your L1 cpu cache since you are accessing this pointer from its parent struct, while for std::vector it can be in any random place in your memory.

In my own code, using smallvector judicously makes drastic performance differences and allows to forego an immense amount of memory allocations.

jstimpfle · on Jan 30, 2019

Yes, the cache issue is why I was suggesting to instead just exercise a little more control about where the heap allocated array is actually located. Choosing a longer lifetime for the array than the lifetime of the function that uses the array, one might already have achieved that the thing is cached. Another option could be to require a temporary array argument from the caller. Or (as someone else suggested) delegating that problem to the GC (which I assume might be able to use "young generation" memory).

s3cur3 · on Jan 30, 2019

In my experience, the big benefit from the SSO vector comes from making it the default and making it easy to use. If I emailed my team to ask them to rethink all their short-lived vectors, no one would change anything... but if I say “prefer this SSO vector to the std one,” they can actually adopt that without significantly changing the way they write code. It’s as much a social problem and developer productivity problem as it is about what’s actually best.

jstimpfle · on Jan 30, 2019

I think you could email them to rethink whether short-lived things are in general a good idea with regards to a) performance b) maintainability/overseeability/debuggability/flexibility.

IMHO short-lived by default is a wrong practice, and it's mostly a consequence of the misaplied ideology that everything should have as tiny a lexical scope as possible (e.g. make stack variables where possible). And it's a consequence of OOP thinking, where there isn't really a concept of "memory you can use when you need it" but only "objects that are always constructed". And whenever an object comes into or out of existence, work must be done to make that transition official!

If matters were as simple as using SSO vectors by default (or almost always), then new languages would choose SSO optimized datastructures almost everywhere. But is it actually the case that almost all data fits in arrays of length < 16 or so? I don't think so, and using SSO data structures causes more complicated object code and is slower in the larger cases.

forrestthewoods · on Jan 30, 2019

> But you need to "traverse" a pointer for a stack-allocated array just as well!

No. Iterating a small stack array is much faster than iterating a heap array.

Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".

jstimpfle · on Jan 30, 2019

> Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".

No, I got that. But stack memory is main memory as well, and it is accessed through a pointer (in this case the stack pointer) just like heap memory. That is, unless the stack memory is not really stack memory, but optimized by the compiler to be register-only. Which is mostly not possible for stack-allocated arrays (when they are indexed with runtime indices, or when they are simply too large).

It's true that stack memory is (almost) always in the cache, which is why stack memory is considered fast. But couldn't that be mostly true for temporary heap storage as well?

gpderetta · on Jan 30, 2019

Aside from cachability, there is still an extra indirection. If you have a std vector on the stack, to access the first element you need to dereference the stack poiner, at an offset, to fetch the heap pointer and then dereference it. The two loads are in a dependency chain. This is true for the first load at least. For subsequent loads the heap pointer might already be in a register.

For a small vector you still need to dereference the stack pointer to load the heap pointer/discriminant, the perform a conditional jump on it, then, if the vector is stack allocated perform a further load at an offset from the stack pointer. Thanks to jump prediction, the second load does not depend on the first, so the dependency chain is shorter which might make a difference in some scenarios.

jstimpfle · on Jan 30, 2019

Honestly the SmallVector case sounds more complicated. As to the dependency chain in the standard vector version, the heap pointer should be easily put in a register. And the SmallVector has such a dependency just as well: The discrimant must be loaded and tested first. In both cases it shouldn't matter much since it's expected to be optimized to a one-time cost.

gpderetta · on Jan 30, 2019

Sure, the penality is only usually paid for the access to the vector, unless the compiler need to spill the heap pointer.

The branch predicated on the test is executed speculatively, so the next load does not actually depended on the discriminant load and test so the dependency chain is actually shorter (1 load instead of two). If the code is bottlenecked by something other than the dependency chain length (for example number of outstanding loads) of course it doesn't matter, but dependency chains are usually the first bottleneck.

jstimpfle · on Jan 30, 2019

This is insightful. Thanks.

gpderetta · on Jan 30, 2019

Thanks. Mind you, that's the theory, I haven't actually benchmarked the use case. But in general reducing the length of pointer chains to traverse is an often low hanging fruits of performance optimizations.

jstimpfle · on Jan 30, 2019

I've just noticed that while the thing you mentioned about branch prediction might well often work out, nevertheless the compiler needs to emit additional object code for the discrimation test whenever it cannot infer whether an access is "small" or "large". It might mean a significant increase in code size. Of course I didn't measure anything, either.

BeeOnRope · on Jan 30, 2019

It depends on the implementation. Using a discriminant plus branch is one way to do it, but another implementation is to just always use a data pointer, which points into the embedded array in the small vector case.

Now there is no branch and in fact the element access code is identical to std::vector case. Only a few, less frequently called paths need to be aware of the small vs large cases, such as the resize code (which needs to not free the embedded buffer).

The main downsides are that you have added back the indirection in the small vector case (indirection is unconditional), and that you "waste" the 8 bytes for the data pointer in the small case, as the discriminant version can reuse that for vector data.

IMO the observed performance improvements are basically 100% due to the avoidance of malloc and in the real world (but not this benchmark) due to better cache friendliness of on-stack data, but not because any vector functions such as element access are substantially sped up. That explains why the benefits disappear (in a relative sense) pretty quickly as the element size increases, even for the large-embedded-buffer case where all of the elements are still on the stack: the malloc benefit is (roughly) a one-time benefit per vector, not a "per access" or "per element" benefit so as the vector grows, the relative benefit diminishes and the non-malloc operation costs come to dominate.

slavik81 · on Jan 30, 2019

The thing is that a cache line is 64 bytes, and loading from main memory to cache is orders of magnitude slower than from cache to register.

In the std::vector case, it's loading 64 bytes into cache just to get the heap pointer. Then the heap pointer tells it which other 64 bytes to load.

In the SmallVector case, the 64 bytes that contain the heap pointer are likely to include the entire vector, entirely eliminating the second load.

jstimpfle · on Jan 30, 2019

Since it's stack memory, in both cases you can assume that the struct (whether or not including a small-array) is already loaded in the cache. The question is can't you achieve that the heap memory of the vector version is also in the cache if you just care a tiny little bit about data organization.

BeeOnRope · on Jan 30, 2019

Sure, it can be the case that both the data and the vector struct are in cache (in different lines). Analyses like "stack is in cache, but heap is in main memory" are too simple to capture the subtleties of real programs. In general, for a benchmark, you can expect everything to be in cache, unless the data is large enough that it doesn't fit, which usually means that it won't fit regardless of small-vector optimization or not.

That said, purely from a memory access and cache use point of view, it is more or less strictly better to pack all the vector data (data and metadata) together as the small vector does: you'll always only bring in the one cache line containing everything. In the split stack + heap case [1], sure both lines might be cache, so the access time might be the same, but you still need to have both of those lines in the cache, so the footprint is bigger. So at beast it can be "tied", but at some point you'll suffer misses with this strategy that you wouldn't suffer if everything were co-located. It follows directly from the general rule that you want to pack data accessed together close together.

---

[1] Of course it might be heap + heap since the std::vector itself might be allocated on the heap but it doesn't really change the analysis.

forrestthewoods · on Jan 30, 2019

When I said main memory I meant failing to get a cache hit.

If you’re hitting the allocator (slow!) then odds are pretty good those bytes aren’t fresh in the cache. They could be if you just read them and then freed them and then the allocator handed them right back. But that’s probably not a good pattern either.

Another form of this pattern is the small string optimization. It’s exceptionally common. https://blogs.msmvps.com/gdicanio/2016/11/17/the-small-strin...

chrisseaton · on Jan 30, 2019

> If you’re hitting the allocator (slow!) then odds are pretty good those bytes aren’t fresh in the cache.

Incidentally, this is a great reason why GCs are useful even if you don't need the automatic memory management - a GC can allocate from memory that is always in cache, and they can do it in a couple of cycles.