It's mixed. You get something in the neighborhood of a 3-4x speedup with SHA-NI, but the algorithm is fundamentally serial. Fully parallel algorithms like BLAKE3 and K12, which can use wide vector extensions like AVX-512, can be substantially faster (10x+) even on one core. And multithreading compounds with that, if you have enough input to keep a lot of cores occupied. On the other hand, if you're limited to one thread and older/smaller vector extensions (SSE, NEON), hardware-accelerated SHA-256 can win. It can also win in the short input regime where parallelism isn't possible (< 4 KiB for BLAKE3).