15.ai

Created by an artificial intelligence researcher known as 15 during their time at the Massachusetts Institute of Technology, the application allowed users to make characters from video games, television shows, and movies speak custom text with emotional inflections faster than real-time.

Voice actors and industry professionals debated the technology's merits for fan creativity versus its potential impact on the profession, particularly following controversies over unauthorized commercial use.

While many critics praised the website's accessibility and emotional control, they also noted technical limitations in areas like prosody options and language support.

Its shutdown was followed by the emergence of various commercial alternatives in subsequent years, with their founders acknowledging 15.ai's influence in the field of deep learning speech synthesis.

Previously, concatenative synthesis—which worked by stitching together pre-recorded segments of human speech—was the predominant method for generating artificial speech, but it often produced robotic-sounding results at the boundaries of sentences.

[3] Two years later, this was followed by Google AI's Tacotron 2 in 2018, which demonstrated that neural networks could produce highly natural speech synthesis but required substantial training data—typically tens of hours of audio—to achieve acceptable quality.

[6] Chinese tech companies also made significant contributions to the field, with Baidu and ByteDance developing proprietary text-to-speech frameworks that further advanced the technology, though specific technical details of their implementations remained largely undisclosed.

It also demonstrates the progress of my research in a far more engaging manner - by being able to use the actual model, you can discover things about it that even I wasn't aware of (such as getting characters to make gasping noises or moans by placing commas in between certain phonemes).

[12] The developer had originally planned to pursue a doctorate based on their undergraduate research, but opted to work in the tech industry instead after their startup was accepted into the Y Combinator accelerator in 2019.

[17] At its peak, the platform incurred operational costs of US$12,000[7] per month from AWS infrastructure needed to handle millions of daily voice generations; despite receiving offers from companies to acquire 15.ai and its underlying technology, the website remained independent and was funded out of the personal previous startup earnings of the developer[7]—then aged 23 at the time.

[9] I'm partnering with @VoiceverseNFT to explore ways where together we might bring new tools to new creators to make new things, and allow everyone a chance to own & invest in the IP's they create.

[20] Log files showed that Voiceverse had generated audio of characters from My Little Pony: Friendship Is Magic using 15.ai, pitched them up to make them sound unrecognizable from the original voices to market their own platform—in violation of 15.ai's terms of service.

[21] Voiceverse claimed that someone in their marketing team used the voice without properly crediting 15.ai; in response, 15 tweeted "Go fuck yourself,"[22] which went viral, amassing thousands of retweets and likes on Twitter in support of the developer.

[26] Users generated speech by inputting text and selecting a character voice, with optional parameters for emotional contextualizers and phonetic transcriptions.

For modern and Internet-specific terminology, the system incorporated pronunciation data from user-generated content websites, including Reddit, Urban Dictionary, 4chan, and Google.

The flow and generative adversarial network (GAN) hybrid vocoder and denoiser, introduced in an earlier version, was streamlined to remove manual parameter inputs.

[37] Natalie Clayton of PC Gamer wrote that SpongeBob SquarePants' voice was replicated well, but noted challenges in mimicking the Narrator from the The Stanley Parable: "the algorithm simply can't capture Kevan Brighting's whimsically droll intonation.

[43] In a post introducing new character additions to 15.ai, Equestria Daily's founder Shaun Scotellaro—also known by his online moniker "Sethisto"—wrote that "some of [the voices] aren't great due to the lack of samples to draw from, but many are really impressive still anyway.

[46] Similarly, Eugenio Moto of Spanish news website Qore.com wrote that "the most experienced [users] can change parameters like the stress or the tone.

[49] Chinese gaming news outlet GamerSky called the app "interesting", but also criticized the word count limit of the text and the lack of intonations.

[44] South Korean video game outlet Zuntata wrote that "the surprising thing about 15.ai is that [for some characters], there's only about 30 seconds of data, but it achieves pronunciation accuracy close to 100%".

[56] The controversy surrounding Voiceverse NFT and subsequent discussions highlighted broader industry concerns about AI voice synthesis technology.

[57] While 15.ai limited its scope to fictional characters and did not reproduce voices of real people or celebrities,[58] computer scientist Andrew Ng noted that similar technology could be used to do so, including for nefarious purposes.

And how many YouTube how-to video producers would love to have a synthetic Morgan Freeman narrate their scripts?While discussing potential risks, he added: "...but synthesizing a human actor's voice without consent is arguably unethical and possibly illegal.

[63] Fan creations included skits and new fan animations,[64] crossover content—such as Game Informer writer Liana Ruppert's demonstration combining Portal and Mass Effect dialogue in her coverage of the platform[65]—recreations of viral videos (including the infamous Big Bill Hell's Cars car dealership parody[66]), adaptations of fanfiction using AI-generated character voices,[67] music videos and new musical compositions—such as the explicit Pony Zone series[68]—and content where characters recited sea shanties.

[69] Some fan creations gained mainstream attention, such as a viral edit replacing Donald Trump's cameo in Home Alone 2: Lost in New York with the Heavy Weapons Guy's AI-generated voice, which was featured on a daytime CNN segment in January 2021.

Its integration of DeepMoji for emotional analysis demonstrated the viability of incorporating sentiment-aware speech generation, while its support for ARPABET phonetic transcriptions set a standard for precise pronunciation control in public-facing voice synthesis tools.

Earlier systems like Google AI's Tacotron and Microsoft Research's FastSpeech required tens of hours of audio to produce acceptable results and failed to generate intelligible speech with less than 24 minutes of training data.

[4][78] In contrast, 15.ai demonstrated the ability to generate speech with substantially less training data—specifically, the name "15.ai" refers to the creator's claim that a voice could be cloned with just 15 seconds of data.

A comparison of the alignments ( attentions ) between Tacotron and a modified variant of Tacotron

An example of a multi-speaker embedding. The neural network maps the predicted timestamps to a masked embedding sequence that encodes speaker information.

Avatar of Troy Baker