Other existing approaches often use smaller, more closely matched datasets for audio-text training,(^reference-1) (^reference-2)(^reference-3) or use a wide but unsupervised audio pre-exercise.(^reference-4)(^reference-5)(^reference-6) Because Whisper was trained on a large and diverse dataset and wasn't fine-tuned to any particular one, it doesn't outperform models that specialize in the performance of LibriSpeech, a well-known competitive benchmark in speech recognition. However, when we measure Whisper's zero-shot performance on many different datasets, we find that it is much more robust and makes 50% fewer errors than those models.
About a third of Whisper's audio dataset is not in English and is alternately tasked with either original-language transcription or English translation. We find this approach particularly effective in learning speech-to-text translation and outperforms supervised SOTA on CoVoST2 English translation in the null case.