The technique, outlined in a paper in September 2016,[1] is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech.
[2] WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.
[3] Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft's Cortana, Amazon Alexa and the Google Assistant.
[6] It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds.
[11] According to the original September 2016 DeepMind research paper WaveNet: A Generative Model for Raw Audio,[12] the network was fed real waveforms of speech in English and Mandarin.
The January 2019 follow-up paper Unsupervised speech representation learning using WaveNet autoencoders[17] details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices on existing audio recordings, in order to make it more reliable.
Another follow-up paper, Sample Efficient Adaptive Text-to-Speech,[18] dated September 2018 (latest revision January 2019), states that DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results.
According to a 2016 BBC article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis.