Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> llvm::SmallVector [...] this class preallocates some amount of data (the actual size of which is configurable via a template parameter) locally on the stack, and only performs a (slow!) heap allocation if the container grows beyond that size. Because malloc is slow, and traversing a pointer to get to your data is slow, the SSO heap storage is a double-win.

But you need to "traverse" a pointer for a stack-allocated array just as well! So it's mainly that in some cases this class helps forgo a malloc, which I am not sure is that much of win, especially given that this implementation is another class that requires more (and more complicated!) object code...

What I think might be better in many situations is preallocating the dynamic array so that it doesn't need to be preallocated each time the function is called. The preallocated array can be a function parameter (or object member, for OOP weenies :>), or simply a global variable.



> But you need to "traverse" a pointer for a stack-allocated array just as well!

By the time you traverse this pointer, its pointed-to memory location is almost certainly already in your L1 cpu cache since you are accessing this pointer from its parent struct, while for std::vector it can be in any random place in your memory.

In my own code, using smallvector judicously makes drastic performance differences and allows to forego an immense amount of memory allocations.


Yes, the cache issue is why I was suggesting to instead just exercise a little more control about where the heap allocated array is actually located. Choosing a longer lifetime for the array than the lifetime of the function that uses the array, one might already have achieved that the thing is cached. Another option could be to require a temporary array argument from the caller. Or (as someone else suggested) delegating that problem to the GC (which I assume might be able to use "young generation" memory).


In my experience, the big benefit from the SSO vector comes from making it the default and making it easy to use. If I emailed my team to ask them to rethink all their short-lived vectors, no one would change anything... but if I say “prefer this SSO vector to the std one,” they can actually adopt that without significantly changing the way they write code. It’s as much a social problem and developer productivity problem as it is about what’s actually best.


I think you could email them to rethink whether short-lived things are in general a good idea with regards to a) performance b) maintainability/overseeability/debuggability/flexibility.

IMHO short-lived by default is a wrong practice, and it's mostly a consequence of the misaplied ideology that everything should have as tiny a lexical scope as possible (e.g. make stack variables where possible). And it's a consequence of OOP thinking, where there isn't really a concept of "memory you can use when you need it" but only "objects that are always constructed". And whenever an object comes into or out of existence, work must be done to make that transition official!

If matters were as simple as using SSO vectors by default (or almost always), then new languages would choose SSO optimized datastructures almost everywhere. But is it actually the case that almost all data fits in arrays of length < 16 or so? I don't think so, and using SSO data structures causes more complicated object code and is slower in the larger cases.


> But you need to "traverse" a pointer for a stack-allocated array just as well!

No. Iterating a small stack array is much faster than iterating a heap array.

Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".


> Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".

No, I got that. But stack memory is main memory as well, and it is accessed through a pointer (in this case the stack pointer) just like heap memory. That is, unless the stack memory is not really stack memory, but optimized by the compiler to be register-only. Which is mostly not possible for stack-allocated arrays (when they are indexed with runtime indices, or when they are simply too large).

It's true that stack memory is (almost) always in the cache, which is why stack memory is considered fast. But couldn't that be mostly true for temporary heap storage as well?


Aside from cachability, there is still an extra indirection. If you have a std vector on the stack, to access the first element you need to dereference the stack poiner, at an offset, to fetch the heap pointer and then dereference it. The two loads are in a dependency chain. This is true for the first load at least. For subsequent loads the heap pointer might already be in a register.

For a small vector you still need to dereference the stack pointer to load the heap pointer/discriminant, the perform a conditional jump on it, then, if the vector is stack allocated perform a further load at an offset from the stack pointer. Thanks to jump prediction, the second load does not depend on the first, so the dependency chain is shorter which might make a difference in some scenarios.


Honestly the SmallVector case sounds more complicated. As to the dependency chain in the standard vector version, the heap pointer should be easily put in a register. And the SmallVector has such a dependency just as well: The discrimant must be loaded and tested first. In both cases it shouldn't matter much since it's expected to be optimized to a one-time cost.


Sure, the penality is only usually paid for the access to the vector, unless the compiler need to spill the heap pointer.

The branch predicated on the test is executed speculatively, so the next load does not actually depended on the discriminant load and test so the dependency chain is actually shorter (1 load instead of two). If the code is bottlenecked by something other than the dependency chain length (for example number of outstanding loads) of course it doesn't matter, but dependency chains are usually the first bottleneck.


This is insightful. Thanks.


Thanks. Mind you, that's the theory, I haven't actually benchmarked the use case. But in general reducing the length of pointer chains to traverse is an often low hanging fruits of performance optimizations.


I've just noticed that while the thing you mentioned about branch prediction might well often work out, nevertheless the compiler needs to emit additional object code for the discrimation test whenever it cannot infer whether an access is "small" or "large". It might mean a significant increase in code size. Of course I didn't measure anything, either.


It depends on the implementation. Using a discriminant plus branch is one way to do it, but another implementation is to just always use a data pointer, which points into the embedded array in the small vector case.

Now there is no branch and in fact the element access code is identical to std::vector case. Only a few, less frequently called paths need to be aware of the small vs large cases, such as the resize code (which needs to not free the embedded buffer).

The main downsides are that you have added back the indirection in the small vector case (indirection is unconditional), and that you "waste" the 8 bytes for the data pointer in the small case, as the discriminant version can reuse that for vector data.

IMO the observed performance improvements are basically 100% due to the avoidance of malloc and in the real world (but not this benchmark) due to better cache friendliness of on-stack data, but not because any vector functions such as element access are substantially sped up. That explains why the benefits disappear (in a relative sense) pretty quickly as the element size increases, even for the large-embedded-buffer case where all of the elements are still on the stack: the malloc benefit is (roughly) a one-time benefit per vector, not a "per access" or "per element" benefit so as the vector grows, the relative benefit diminishes and the non-malloc operation costs come to dominate.


The thing is that a cache line is 64 bytes, and loading from main memory to cache is orders of magnitude slower than from cache to register.

In the std::vector case, it's loading 64 bytes into cache just to get the heap pointer. Then the heap pointer tells it which other 64 bytes to load.

In the SmallVector case, the 64 bytes that contain the heap pointer are likely to include the entire vector, entirely eliminating the second load.


Since it's stack memory, in both cases you can assume that the struct (whether or not including a small-array) is already loaded in the cache. The question is can't you achieve that the heap memory of the vector version is also in the cache if you just care a tiny little bit about data organization.


Sure, it can be the case that both the data and the vector struct are in cache (in different lines). Analyses like "stack is in cache, but heap is in main memory" are too simple to capture the subtleties of real programs. In general, for a benchmark, you can expect everything to be in cache, unless the data is large enough that it doesn't fit, which usually means that it won't fit regardless of small-vector optimization or not.

That said, purely from a memory access and cache use point of view, it is more or less strictly better to pack all the vector data (data and metadata) together as the small vector does: you'll always only bring in the one cache line containing everything. In the split stack + heap case [1], sure both lines might be cache, so the access time might be the same, but you still need to have both of those lines in the cache, so the footprint is bigger. So at beast it can be "tied", but at some point you'll suffer misses with this strategy that you wouldn't suffer if everything were co-located. It follows directly from the general rule that you want to pack data accessed together close together.

---

[1] Of course it might be heap + heap since the std::vector itself might be allocated on the heap but it doesn't really change the analysis.


When I said main memory I meant failing to get a cache hit.

If you’re hitting the allocator (slow!) then odds are pretty good those bytes aren’t fresh in the cache. They could be if you just read them and then freed them and then the allocator handed them right back. But that’s probably not a good pattern either.

Another form of this pattern is the small string optimization. It’s exceptionally common. https://blogs.msmvps.com/gdicanio/2016/11/17/the-small-strin...


> If you’re hitting the allocator (slow!) then odds are pretty good those bytes aren’t fresh in the cache.

Incidentally, this is a great reason why GCs are useful even if you don't need the automatic memory management - a GC can allocate from memory that is always in cache, and they can do it in a couple of cycles.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: