"Your example is a false comparison, generating a hash is very different from pointer chasing the address generated by the hash."
That's not true at all in the case of ethash, where the running time is dominated by waiting for memory read ops to complete, not waiting for ALU ops (hashing) to finish executing.
I have also written an Equihash miner where the workload is similar: running time dominated by hashtable reads or writes, and can confirm GPUs beat CPUs.
I reiterate: GPUs excel at data-dependent random reads, compared to CPUs. Sure, these are very difficult workloads for both CPUs and GPUs, but GPUs still trump CPUs. That's because the atom size (minimum number of bytes that a GPU/CPU can read/write on the memory bus) is the same or better on GPU: 64 bytes on CPU (DDR4), and 32/64 bytes on GPU (HBM2), and GPUs have a significantly higher memory throughput up to 1000 GB/s while CPUs are still stuck around 200 GB/s per socket (AMD EPYC Rome 8-channel DDR4-3200).
So in ethash or Equihash mining workloads, many data-dependent read ops across a multi-GB data structure (much larger than local caches) will be mostly bottlenecked by (1) the maximum number of outstanding read ops the CPU/GPU memory controller can handle and (2) the overall memory throughput. In the case of GPUs, (1) is not really a problem, so you end up being bottlenecked by overall memory througphut. That's why GPUs win.
As of 3-4 years ago I remember Intel CPUs having a maximum number of 10 outstanding memory operations so (1) was the bottleneck. But things could have changed with more recent CPUs. In any case, even if (1) is not a bottleneck on CPUs, their lower memory throughput guarantee they will perform worse than GPUs on such workloads.
Correct, in GPUs can indeed do a better job at hiding latency through massive parallelism.
My expertise might be outdated here, but the problem used to be that actually getting to that max bandwidth through divergent warps and uncoalesced reads was just impossible.
Is this still the case with Volta? Did you avoid these issues in your Equihash implementation?
Divergent warps are still a huge problem (but SLIDE doesn't have this problem AFAIK).
Uncoalesced reads are not a problem severe enough to make GPUs underperform CPUs. Or, said another way, uncoalesced reads come with a roughly equally bad performance impact on both GPUs and CPUs.
The only reason GPUs can hide the latency is the massive parallelism in the problem space (computing the hash for nonce n doesn't block nonce n + 1). This algorithm involves a lot of data-dependency, so a computer for training these networks actually may be memory-latency bound (unless you are training a ton of neural networks and can hide the latency), which is extremely bad for GPUs.
That's not true at all in the case of ethash, where the running time is dominated by waiting for memory read ops to complete, not waiting for ALU ops (hashing) to finish executing.
I have also written an Equihash miner where the workload is similar: running time dominated by hashtable reads or writes, and can confirm GPUs beat CPUs.
I reiterate: GPUs excel at data-dependent random reads, compared to CPUs. Sure, these are very difficult workloads for both CPUs and GPUs, but GPUs still trump CPUs. That's because the atom size (minimum number of bytes that a GPU/CPU can read/write on the memory bus) is the same or better on GPU: 64 bytes on CPU (DDR4), and 32/64 bytes on GPU (HBM2), and GPUs have a significantly higher memory throughput up to 1000 GB/s while CPUs are still stuck around 200 GB/s per socket (AMD EPYC Rome 8-channel DDR4-3200).
So in ethash or Equihash mining workloads, many data-dependent read ops across a multi-GB data structure (much larger than local caches) will be mostly bottlenecked by (1) the maximum number of outstanding read ops the CPU/GPU memory controller can handle and (2) the overall memory throughput. In the case of GPUs, (1) is not really a problem, so you end up being bottlenecked by overall memory througphut. That's why GPUs win.
As of 3-4 years ago I remember Intel CPUs having a maximum number of 10 outstanding memory operations so (1) was the bottleneck. But things could have changed with more recent CPUs. In any case, even if (1) is not a bottleneck on CPUs, their lower memory throughput guarantee they will perform worse than GPUs on such workloads.