Why would the struct-of-arrays layout be beneficial in a binary tree where every...

hansvm · on Nov 27, 2022

Array of struct tends to only be better when there's high temporal locality of using nearly all the fields.

This code, as written, makes left node accesses temporally correlated with other left node accesses (and by symmetry must do the same for right nodes), allowing the caches at all levels to have a higher hit rate. Whether that higher rate is actually attained depends on the architecture, compiler, other running code, linker, and all sorts of other garbage, but it has a lot higher chance of better behavior anyway.

cormacrelf · on Nov 28, 2022

For anyone interested, Rune's SOA is done entirely by DataDraw. The API for creating relations is passed almost verbatim to datadraw, which will generate a bunch of C code for interacting with the DD library. So Rune didn't really write any of this, it just bundled DD with a language runtime, and used a more permissive license despite DataDraw being LGPL... ! It was taking too long to set up Rune on an Ubuntu VM, so I just replicated Rune's version of the Benchmarks Game binary-tree benchmark using DataDraw directly.

DataDraw runs the depth=21 case (single threaded, -O3) in about 4.6 seconds on my M1. It has the built-in advantage of reusing the table allocations as the database is global and implicit, the indexes being u32 sized. The existing C++g++#7/MemoryPool and Rust#5/bumpalo versions (both array-of-structs) are discarding their allocations on every iteration. Running Rust#5/bumpalo with RAYON_NUM_THREADS=1 runs in about 2.7 seconds. So in addition to the implementation being someone else's and possibly misused LGPL, the claims re the binary tree benchmark are not really made out.

Nevertheless you can very closely approximate in Rust what DataDraw does, without macros or anything else, by creating a single global struct of `slab::Slab`s, inserting a dummy first element, converting the usize keys to NonZeroU32 and using this as your SOA. If you do that, then you get roughly the same perf except that DataDraw isn't doing bounds checks. So the Rust version can do it in about 5.5 seconds vs 4.6. Obviously this can't be parallelised as you're passing a single &mut SOA around or using a thread-local.

I also compared the perf for a `slotmap::SlotMap`. It was about 7 seconds, but it solves the ABA problem with the generational indices, so that may be worth it.

Finally I compared the perf for a global Slab<Node>, i.e. global AOS style, to avoid clouding the AOS results with bumpalo's insane performance. It ran quicker than the global SOA style, in 4.6 seconds, because the overhead of managing two arrays, their separate allocations and their bounds checks was too much for the memory access pattern to do anything, if there indeed was any advantage to it.

blep_ · on Nov 27, 2022

Some memory-touching orders are better than others, because CPUs will prefetch nearby memory into CPU cache during memory accesses.