One reason for the discrepancy between quoted numbers is that, if you are only after pushing that number down and not particularly interested in getting a scalable system, then you are free to run as many systems as you like in as many configurations as possible and then try to combine their outputs (ROVER etc.).
Yeah. I don't get the impression that the Kaldi core team has been trying very hard recently to get SOTA on eval2k/switchboard. This number uses one acoustic model with a trigram LM decode + fourgram rescoring -- there isn't even a neural net language model in there. If I remember correctly, Microsoft's first "human parity" result used something like three acoustic models and at least four types of language models. This Kaldi model is competitive with the best single acoustic model Microsoft used.
Fully agree. I think their work on training data augmentation (e.g., their ICASSP paper, http://danielpovey.com/files/2017_icassp_reverberation.pdf, or the ASPiRE model before) has a bigger impact on the practical usefulness of ASR than getting an X% relative improvement over the previous SOTA on the eval2000 set.
https://github.com/kaldi-asr/kaldi/blob/master/egs/fisher_sw...
Kaldi hasn't been in first place on that dataset recently, but it was a few years ago.
On other more researchy datasets (eg. for distant speakers or languages other than English), the best system is often based on Kaldi.