Introducing Whisper

Other existing approaches often use smaller, more closely matched datasets for audio-text training,^{(^reference-1)} ^{(^reference-2)}^{(^reference-3)} or use a wide but unsupervised audio pre-exercise.^{(^reference-4)}^{(^reference-5)}^{(^reference-6)} Because Whisper was trained on a large and diverse dataset and wasn't fine-tuned to any particular one, it doesn't outperform models that specialize in the performance of LibriSpeech, a well-known competitive benchmark in speech recognition. However, when we measure Whisper's zero-shot performance on many different datasets, we find that it is much more robust and makes 50% fewer errors than those models.

About a third of Whisper's audio dataset is not in English and is alternately tasked with either original-language transcription or English translation. We find this approach particularly effective in learning speech-to-text translation and outperforms supervised SOTA on CoVoST2 English translation in the null case.

Source link

Leave a Reply Cancel reply

Podcasts