No, that is definitely not the case, not all memory-bounds are of the same type....

atq2119 · on March 6, 2020

GPUs still have higher memory bandwidth than CPUs though. So if you miss in caches all the time, GPUs can still potentially come out on top (as long as you can keep the higher memory latency under control, which the massive effective hyper-threading of GPUs should be able to help with). That's the point of using GPUs for solving memory-hard problems for mining those "ASIC-resistant" coins.

mrb · on March 6, 2020

"Your example is a false comparison, generating a hash is very different from pointer chasing the address generated by the hash."

That's not true at all in the case of ethash, where the running time is dominated by waiting for memory read ops to complete, not waiting for ALU ops (hashing) to finish executing.

I have also written an Equihash miner where the workload is similar: running time dominated by hashtable reads or writes, and can confirm GPUs beat CPUs.

I reiterate: GPUs excel at data-dependent random reads, compared to CPUs. Sure, these are very difficult workloads for both CPUs and GPUs, but GPUs still trump CPUs. That's because the atom size (minimum number of bytes that a GPU/CPU can read/write on the memory bus) is the same or better on GPU: 64 bytes on CPU (DDR4), and 32/64 bytes on GPU (HBM2), and GPUs have a significantly higher memory throughput up to 1000 GB/s while CPUs are still stuck around 200 GB/s per socket (AMD EPYC Rome 8-channel DDR4-3200).

So in ethash or Equihash mining workloads, many data-dependent read ops across a multi-GB data structure (much larger than local caches) will be mostly bottlenecked by (1) the maximum number of outstanding read ops the CPU/GPU memory controller can handle and (2) the overall memory throughput. In the case of GPUs, (1) is not really a problem, so you end up being bottlenecked by overall memory througphut. That's why GPUs win.

As of 3-4 years ago I remember Intel CPUs having a maximum number of 10 outstanding memory operations so (1) was the bottleneck. But things could have changed with more recent CPUs. In any case, even if (1) is not a bottleneck on CPUs, their lower memory throughput guarantee they will perform worse than GPUs on such workloads.

yvdriess · on March 6, 2020

Correct, in GPUs can indeed do a better job at hiding latency through massive parallelism.

My expertise might be outdated here, but the problem used to be that actually getting to that max bandwidth through divergent warps and uncoalesced reads was just impossible.

Is this still the case with Volta? Did you avoid these issues in your Equihash implementation?

mrb · on March 6, 2020

Divergent warps are still a huge problem (but SLIDE doesn't have this problem AFAIK).

Uncoalesced reads are not a problem severe enough to make GPUs underperform CPUs. Or, said another way, uncoalesced reads come with a roughly equally bad performance impact on both GPUs and CPUs.

plusplusc · on March 6, 2020

The only reason GPUs can hide the latency is the massive parallelism in the problem space (computing the hash for nonce n doesn't block nonce n + 1). This algorithm involves a lot of data-dependency, so a computer for training these networks actually may be memory-latency bound (unless you are training a ton of neural networks and can hide the latency), which is extremely bad for GPUs.

pcwalton · on March 6, 2020

Well, depends on the size of the hash table and the particular memory access patterns. Lookups into many small hash tables, or workloads in which most threads in a threadgroup all fetch the same entry, can be very efficient on GPUs. Sparse virtual texturing is often implemented with a hash table and works well on GPUs because the hash table involved has both of these properties.

(I'm sure you know this, just wanted to clarify.)

yvdriess · on March 6, 2020

Yes, a very good point. I am assuming the tables are quite large due to the workload. If it's large enough to give a benefit to large pages in reduced DTLB misses, it's likely too large for warp-local memory :)

tyingq · on March 6, 2020

"We pre-allocate 1000 2MB Hugepages and 10 1GB Hugepages which are found to be enough for both Delicious-200K and Amazon-670K datasets."

So ~12GB total for those datasets.