Speech processing

[1] Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels.

Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.

[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.

[3] Linear predictive coding (LPC), a speech processing algorithm, was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) in 1966.

[4] Further developments in LPC technology were made by Bishnu S. Atal and Manfred R. Schroeder at Bell Labs during the 1970s.

[6] By the early 2000s, the dominant speech processing strategy started to shift away from Hidden Markov Models towards more modern neural networks and deep learning.

[7] In 2012, Geoffrey Hinton and his team at the University of Toronto demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks.

[10] These systems utilized deep learning models to provide more natural and accurate voice interactions.

The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition.

[11][8] In recent years, end-to-end speech recognition models have gained popularity.

[12] Dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed.

In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules.

[citation needed] A hidden Markov model can be represented as the simplest dynamic Bayesian network.