I'd really like to see a "pull out all the stops" benchmark using highly-optimised Asm for the two architectures, as then it's just a matter of how much you can squeeze out of the CPU itself and not something limited by the thick layers of language abstractions on top of that. That would be a nice theoretical maximum to compare against.
Edit: I tested the C++ version on my 5-year-old i7, with an even older compiler (just had to modify the code to not use C++11 features), and with the max optimisation level, it produces a result of 1465ms - which is pretty damn amazing, considering that this is a 16-year-old compiler generating 32-bit code and the most recent CPU it had knowledge of was the Pentium Pro (P6)! I'm convinced that an Asm version could be <1s though, so there's still plenty of room for improvement.
You have to give Intel/AMD a lot of credit for improving the performance of existing binaries. They've reduced CPI and improved ILP a lot in 16 years, not to mention vastly improved various forms of prediction, caching and speculation. A lot of the performance improvement gleaned from modern compilers will be better reduction of the C++ language itself, e.g. smarter inlining, LTO, devirtualisation etc.
Edit: I tested the C++ version on my 5-year-old i7, with an even older compiler (just had to modify the code to not use C++11 features), and with the max optimisation level, it produces a result of 1465ms - which is pretty damn amazing, considering that this is a 16-year-old compiler generating 32-bit code and the most recent CPU it had knowledge of was the Pentium Pro (P6)! I'm convinced that an Asm version could be <1s though, so there's still plenty of room for improvement.