Other existing approaches often use smaller, more tightly paired speech and text training datasets.[^reference-1] [^reference-2][^reference-3] Or use extensive but unsupervised voice pre-training.[^reference-4][^reference-5][^reference-6] Because Whisper was trained on a large and diverse dataset and is not fine-tuned for a specific dataset, it cannot compete with performance-focused models like LibriSpeech, a well-known competitive benchmark for speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets, we find that it is much more robust and has 50% fewer errors than those models.
Approximately one-third of Whisper’s audio dataset is in a non-English language, and you are alternately tasked with transcribing it in the original language or translating it into English. We found that this approach is particularly effective for learning speech-to-text translation, and supervised zero-shot translation from CoVoST2 to English performs better than SOTA.