It's my understanding that fp16 (available on the previous generation P100) and ...

It's my understanding that fp16 (available on the previous generation P100) and mixed-precision (major innovation of V100) are different things and the speedup of TensorCores is entirely missing from this benchmark. Unlike the general purpose P100, the TPU is a heavily optimized chip built for Deep Learning, hence it's performance increase. However, the V100 is also heavily optimized for Deep Learning (arguably the first non-GPU chip) from NVIDIA. I'm in no position to defend NVIDIA here haha but it seems like the benchmark misses the point if this is indeed the case.