I can only give you some guesses, but I think there's some false sharing type of...

zozbot234 · on Nov 23, 2022

> I use AVX-512, and it's not even 2x faster, though it is faster--it should be more than 2x faster because AVX-512 has better instructions to work with. But when I combine this with doing the calculation in threaded parallel chunks on the array, it goes far slower than it should.

You might be saturating your memory bandwidth to the point where it just can't go any faster. Since it seems your problem is easy to parallelize, you might want to experiment with the rust-gpu ecosystem.

kolbe · on Nov 23, 2022

I will say that when I do the same parallelization scheme using non-avx operations, it accelerates properly and goes far faster than the avx versions. One interesting caveat is when the compiler autovectorizes non-intrinsic code, the problem persists.

justaguess2234 · on Nov 23, 2022

Is it not because AVX-512 sets your CPU frequency to be lower?

kolbe · on Nov 23, 2022

Unclear, but the sources I've read say that's just an Intel issue.

https://www.phoronix.com/review/amd-zen4-avx512/6