I think you should consider all of the results: tables 13-15 have comparisons to...

I think you should consider all of the results: tables 13-15 have comparisons to humans across a wide range of datasets. The human WER performance varies from 3.5 to 22.2 (on clean speech) - some datasets are much harder than others. And different people are probably better at different accents than others. On top of that, people aren't great spellers especially when it comes to names and proper nouns. One example off the top of my head - Narendra Modi is in the WSJ dataset. I bet many people would spell that wrong if they only hear it and have never seen it spelled before. Or even worse Tchaikovsky.

For the Mandarin system the human performance was obtained from people in our office, not random Turkers. 4% WER for a group of 5 humans vs 3.7% for the system.

Perhaps you're raising the bar - human level performance no longer consists of an average or median level, but the top 1% or better. I'm not sure that's fair.