Whisper (speech recognition system)

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

[1] OpenAI claims that the combination of different training data used in its development has led to improved recognition of accents, background noise and jargon compared to previous approaches.

[3] Whisper is a weakly-supervised deep learning acoustic model, made using an encoder-decoder transformer architecture.

[5] Speech recognition has had a long history in research; the first approaches made use of statistical methods, such as dynamic time warping, and later hidden Markov models.

At around the 2010s, deep neural network approaches became more common for speech recognition models, which were enabled by the availability of large datasets ("big data") and increased computational performance.

[6] Early approaches to deep learning in speech recognition included convolutional neural networks, which were limited due to their inability to capture sequential data, which later led to developments of Seq2seq approaches, which include recurrent neural networks which made use of long short-term memory.

[9] According to a NYT report, in 2021 OpenAI believed they exhausted sources of higher-quality data to train their large language models and decided to complement scraped web text with transcriptions of YouTube videos and podcasts, and developed Whisper to solve this task.

Special tokens are used to allow the decoder to perform multiple tasks: The training dataset consists of 680,000 hours of labeled audio-transcript pairs sourced from the internet.

[1] It was trained by AdamW optimizer with gradient norm clipping and a linear learning rate decay with warmup, with batch size 256 segments.

OpenAI Whisper architecture
A standard Transformer architecture, showing on the left an encoder, and on the right a decoder.