Speech synthesis

[citation needed] There were also legends of the existence of "Brazen Heads", such as those involving Pope Silvester II (d. 1003 AD), Albertus Magnus (1198–1280), and Roger Bacon (1214–1294).

In 1779, the German-Danish scientist Christian Gottlieb Kratzenstein won the first prize in a competition announced by the Russian Imperial Academy of Sciences and Arts for models he built of the human vocal tract that could produce the five long vowel sounds (in International Phonetic Alphabet notation: [aː], [eː], [iː], [oː] and [uː]).

From his work on the vocoder, Homer Dudley developed a keyboard-operated voice-synthesizer called The Voder (Voice Demonstrator), which he exhibited at the 1939 New York World's Fair.

[8] In 1961, physicist John Larry Kelly, Jr and his colleague Louis Gerstman[9] used an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs.

[citation needed] Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews.

Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey,[10] where the HAL 9000 computer sings the same song as astronaut Dave Bowman puts it to sleep.

During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences.

Typically, the division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram.

[30] An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones.

DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform.

At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA[34] or MBROLA.

[45] Articulatory synthesis consists of computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there.

Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License, with work continuing as gnuspeech.

The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".

More recent synthesizers, developed by Jorge C. Lucero and colleagues, incorporate models of vocal fold biomechanics, glottal aerodynamics and acoustic wave propagation in the bronchi, trachea, nasal and oral cavities, and thus constitute full systems of physics-based speech simulation.

The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

[54] It uses advanced algorithms to analyze the contextual aspects of text, aiming to detect emotions like anger, sadness, happiness, or alarm, which enables the system to understand the user's sentiment,[55] resulting in a more realistic and human-like inflection.

[58] In 2023, VICE reporter Joseph Cox published findings that he had recorded five minutes of himself talking and then used a tool developed by ElevenLabs to create voice deepfakes that defeated a bank's voice-authentication system.

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds in a language).

On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations.

Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and loanwords, whose pronunciations are not obvious from their spellings.

[68] A study in the journal Speech Communication by Amy Drahota and colleagues at the University of Portsmouth, UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling.

Since the Orator chip could also accept speech data from external memory, any additional words or phrases needed could be stored inside the cartridge itself.

The data consisted of strings of analog-filter coefficients to modify the behavior of the chip's synthetic vocal-tract model, rather than simple digitized samples.

The AppleScript Standard Additions includes a say verb that allows a script to use any of the installed voices and to control the pitch, speaking rate and modulation of the spoken text.

Users can download generated audio files to portable devices, e.g. with a help of podcast receiver, and listen to them while walking, jogging or commuting to work.

[87][88] This increases the stress on the disinformation situation coupled with the facts that In March 2020, a freeware web application called 15.ai that generates high-quality voices from an assortment of fictional characters from a variety of media sources was released.

[94] A noted application, of speech synthesis, was the Kurzweil Reading Machine for the Blind which incorporated text-to-phonetics software based on work from Haskins Laboratories and a black-box synthesizer built by Votrax.

[96] The application reached maturity in 2008, when NEC Biglobe announced a web service that allows users to create phrases from the voices of characters from the Japanese anime series Code Geass: Lelouch of the Rebellion R2.

A voice quality synthesizer, developed by Jorge C. Lucero et al. at the University of Brasília, simulates the physics of phonation and includes models of vocal frequency jitter and tremor, airflow noise and laryngeal asymmetries.

Overview of a typical TTS system
Computer and speech synthesizer housing used by Stephen Hawking in 1999
DECtalk demo recording using the Perfect Paul and Uppity Ursula voices
Fidelity Voice Chess Challenger (1979), the first talking chess computer
Speech output from Fidelity Voice Chess Challenger
Speech synthesis example using the HiFi-GAN neural vocoder
A speech synthesis kit produced by Bell System
TI-99/4A speech demo using the built-in vocabulary
A demo of SAM on the C64
Atari ST speech synthesis demo
MacinTalk 1 demo
MacinTalk 2 demo featuring the Mr. Hughes and Marvin voices
Example of speech synthesis with the included Say utility in Workbench 1.3
Votrax Type 'N Talk speech synthesizer (1980)
Stephen Hawking was one of the most famous people to use a speech computer to communicate.