Nice. Your results could be more reproducible, if you'd include CuDNN version. Turing architecture support (and optimizations specific to DNN training) are still relatively recent and there are differences in performance between the versions [1]. Multiplying matrices is kinda tricky ;) [2]
[1] https://developer.nvidia.com/cudnn . [2] https://scholar.google.com/scholar?as_ylo=2019&q=nvidia+gemm