Note that this isn't even a case where a single sqrt() call is non-reproducible, but only sequences of inlined sqrt calls within an unrolled loop and I agree with you.
Some broader context is probably warranted though. This originated out of a discussion with the authors of P3375 [0] about the actual performance costs of reproducibility. I suspected that existing compilers could already do it without a language change and no runtime cost using inline assembly magic. This library was the experiment to see if I was full of it.
There were only a few minor limitations it found. One was this issue, which happens "outside" what the library is attempting to do (though potentially still fixable as your godbolt link demonstrated). Another was that Clang and GCC have slightly different interpretations of "creative" register constraints. Clang's interpretation is closer to GCC's docs than GCC itself, but produces worse code.
Otherwise, this gives you reproducibility right up to the deeper compiler limitations like NaN propagation at essentially no performance cost. I wasn't able to find any "real" cases where it's not reproducible, only incredibly specific situations like this one across all 3 major compilers and even the minor ones I tried.
> but only sequences of inlined sqrt calls within an unrolled loop
Somewhat-relatedly, that's also a problem with vectorized math libraries, affecting both gcc and clang, where the vectorized function has a different result to the scalar standard-libm one (and indeed gcc wants at least "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" to even allow using vector math libraries, even though it takes explicit flags to enable one (clang's fine with just "-fno-math-errno")).
For what it's worth, clang has __arithmetic_fence for doing the exact thing you're using inline asm for I believe; and the clang/llvm instruction-level constrained arith I noted would be the sane way to achieve this.
The code sample shown in P3375 should be just consequences of fma contraction on gcc/clang I believe? i.e. -ffp-contract=off makes the results consistent for both gcc and clang. I do think it's somewhat funky that -ffp-contract=on is the default, but oh well the spec allows it and it is a perf boost (and a precision boost.. if one isn't expecting the less precise result) and one can easily opt out.
Outside of -ffast-math and -ffp-contract=on (and pre-SSE x86-32 (i.e. ≥26-year-old CPUs) where doing things properly is a massive slowdown) I don't think clang and gcc should ever be doing any optimizations that change numerical values (besides NaN bit patterns).
Just optimization-fencing everything, while a nice and easy proof-of-concept, isn't something compiler vendors would just accept as the solution to implement; that's a ~tripling of IR instructions for each fp operation, which'd probably turn into a quite good compilation speed slowdown, besides also breaking a good number of correct optimizations. (though, again, this shouldn't even be necessary)
And -ffast-math should be left alone, besides perhaps desiring support to disable it at a given scope/function; I can definitely imagine that, were some future architecture to add a division instruction that can have 1ULP of error and is faster than the regular division, that compilers would absolutely use it for all divisions on -ffast-math, and you couldn't work around that with just optimization fences.
Wasn't aware of __arithmetic_fence, though there's an open bug ticket noting that it doesn't protect against contraction (https://github.com/llvm/llvm-project/issues/91674). Still worth trying though. I was aware of GCC __builtin_assoc_barrier, but it wasn't documented to prevent contraction when I last checked. Appears they've fixed that since. Hadn't considered the IR / compilation speed issue. I'm aware of the broken optimizations, but they're not really a problem in practice as far as I can tell? You mainly lose some loop optimizations that weren't significant in the "serious" loop heavy numerics code I tested against at work.
P3375 is mainly about contraction, but there's other issues that can crop up. Intermediate promotion occasionally happens and I've also seen cases of intermediate expressions optimized down to constants without rounding error. Autovectorization is also a problem for me given the tendency of certain SIMD units to have FTZ set. I also have certain compilers that are less well-behaved than GCC and Clang in this respect.
My concern isn't accuracy though. Compilers do that fine, no need to second guess them. My hot take is that accuracy is relatively unimportant in most cases. Most code is written by people who have never read a numerical analysis book in their life and built without a full awareness of the compiler flags they're using or what those flags mean for their program. That largely works out because small errors are not usually detectable in high level program behavior except as a consequence of non-reproducibility. I would much rather accept a small amount of rounding error than deal with reproducibility issues across all the hardware I work on.
I didn't really mean the loop thing as much of a problem for the goal of reproducibility (easy enough to just not explicitly request a vector math library).
aarch32 NEON does have an implicit FTZ, and, yeah, such are annoying; though gcc and clang don't use it without -ffast-math (https://godbolt.org/z/3b11dW559)
I do agree that getting consistent results would definitely make sense as the default.
Some broader context is probably warranted though. This originated out of a discussion with the authors of P3375 [0] about the actual performance costs of reproducibility. I suspected that existing compilers could already do it without a language change and no runtime cost using inline assembly magic. This library was the experiment to see if I was full of it.
There were only a few minor limitations it found. One was this issue, which happens "outside" what the library is attempting to do (though potentially still fixable as your godbolt link demonstrated). Another was that Clang and GCC have slightly different interpretations of "creative" register constraints. Clang's interpretation is closer to GCC's docs than GCC itself, but produces worse code.
Otherwise, this gives you reproducibility right up to the deeper compiler limitations like NaN propagation at essentially no performance cost. I wasn't able to find any "real" cases where it's not reproducible, only incredibly specific situations like this one across all 3 major compilers and even the minor ones I tried.
[0] https://isocpp.org/files/papers/P3375R2.html