The 8.5% in this file is what you'd compare to Microsoft and IBM's recent ~5% re...

woodson · on Oct 24, 2017

One reason for the discrepancy between quoted numbers is that, if you are only after pushing that number down and not particularly interested in getting a scalable system, then you are free to run as many systems as you like in as many configurations as possible and then try to combine their outputs (ROVER etc.).

adrianbg · on Oct 24, 2017

Yeah. I don't get the impression that the Kaldi core team has been trying very hard recently to get SOTA on eval2k/switchboard. This number uses one acoustic model with a trigram LM decode + fourgram rescoring -- there isn't even a neural net language model in there. If I remember correctly, Microsoft's first "human parity" result used something like three acoustic models and at least four types of language models. This Kaldi model is competitive with the best single acoustic model Microsoft used.

woodson · on Oct 24, 2017

Fully agree. I think their work on training data augmentation (e.g., their ICASSP paper, http://danielpovey.com/files/2017_icassp_reverberation.pdf, or the ASPiRE model before) has a bigger impact on the practical usefulness of ASR than getting an X% relative improvement over the previous SOTA on the eval2000 set.