Yeah, it’s clear why you can’t have a single optimized C version.
However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?
SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.
However, can’t you have 5 different non-portable optimized C versions, just like you do with the assembly code?
SIMD intrinsics are generally portable across compilers and OSes, because their C API is defined by Intel, not by compiler or OS vendors. When I want software optimized for multiple targets like SSE, AVX1, AVX2, I sometimes do that in C++.