I did not know the trick of using memcpy to get the compiler to load a register from an arbitrary char* pointer. Given that the architecture supports it, can you assume it will be optimized like that? I assume this is done to stay away from undefined behavior.
Poking around godbot seems to indicate ARM and x86 does it, but web assembly doesn't: https://www.godbolt.org/z/r86f9nr1q which I guess makes sense (but web assembly has to become actual code at some point, so maybe the optimization is lost entirely?).