Honestly the SmallVector case sounds more complicated. As to the dependency chain in the standard vector version, the heap pointer should be easily put in a register. And the SmallVector has such a dependency just as well: The discrimant must be loaded and tested first. In both cases it shouldn't matter much since it's expected to be optimized to a one-time cost.
Sure, the penality is only usually paid for the access to the vector, unless the compiler need to spill the heap pointer.
The branch predicated on the test is executed speculatively, so the next load does not actually depended on the discriminant load and test so the dependency chain is actually shorter (1 load instead of two). If the code is bottlenecked by something other than the dependency chain length (for example number of outstanding loads) of course it doesn't matter, but dependency chains are usually the first bottleneck.
Thanks. Mind you, that's the theory, I haven't actually benchmarked the use case. But in general reducing the length of pointer chains to traverse is an often low hanging fruits of performance optimizations.
I've just noticed that while the thing you mentioned about branch prediction might well often work out, nevertheless the compiler needs to emit additional object code for the discrimation test whenever it cannot infer whether an access is "small" or "large". It might mean a significant increase in code size. Of course I didn't measure anything, either.
It depends on the implementation. Using a discriminant plus branch is one way to do it, but another implementation is to just always use a data pointer, which points into the embedded array in the small vector case.
Now there is no branch and in fact the element access code is identical to std::vector case. Only a few, less frequently called paths need to be aware of the small vs large cases, such as the resize code (which needs to not free the embedded buffer).
The main downsides are that you have added back the indirection in the small vector case (indirection is unconditional), and that you "waste" the 8 bytes for the data pointer in the small case, as the discriminant version can reuse that for vector data.
IMO the observed performance improvements are basically 100% due to the avoidance of malloc and in the real world (but not this benchmark) due to better cache friendliness of on-stack data, but not because any vector functions such as element access are substantially sped up. That explains why the benefits disappear (in a relative sense) pretty quickly as the element size increases, even for the large-embedded-buffer case where all of the elements are still on the stack: the malloc benefit is (roughly) a one-time benefit per vector, not a "per access" or "per element" benefit so as the vector grows, the relative benefit diminishes and the non-malloc operation costs come to dominate.
Since it's stack memory, in both cases you can assume that the struct (whether or not including a small-array) is already loaded in the cache. The question is can't you achieve that the heap memory of the vector version is also in the cache if you just care a tiny little bit about data organization.
Sure, it can be the case that both the data and the vector struct are in cache (in different lines). Analyses like "stack is in cache, but heap is in main memory" are too simple to capture the subtleties of real programs. In general, for a benchmark, you can expect everything to be in cache, unless the data is large enough that it doesn't fit, which usually means that it won't fit regardless of small-vector optimization or not.
That said, purely from a memory access and cache use point of view, it is more or less strictly better to pack all the vector data (data and metadata) together as the small vector does: you'll always only bring in the one cache line containing everything. In the split stack + heap case [1], sure both lines might be cache, so the access time might be the same, but you still need to have both of those lines in the cache, so the footprint is bigger. So at beast it can be "tied", but at some point you'll suffer misses with this strategy that you wouldn't suffer if everything were co-located. It follows directly from the general rule that you want to pack data accessed together close together.
---
[1] Of course it might be heap + heap since the std::vector itself might be allocated on the heap but it doesn't really change the analysis.