[1][2][3][4] Initially developed with the intent to enhance various aspects of human life, it has practical applications such as generating audiobooks and assisting individuals who have lost their voices due to medical conditions.
[5][6] Additionally, it has commercial uses, including the creation of personalized digital assistants, natural-sounding text-to-speech systems, and advanced speech translation services.
[3] This has led to cybersecurity concerns among the global public about the side effects of using audio deepfakes, including its possible role in disseminating misinformation and disinformation in audio-based social media platforms.
[13] In early 2020, the same technique impersonated a company director as part of an elaborate scheme that convinced a branch manager to transfer $35 million.
[17][18] In March 2023, the United States Federal Trade Commission issued a warning to consumers about the use of AI to fake the voice of a family member in distress asking for money.
[20] That same month, an audio deepfake of Slovak politician Michal Šimečka falsely claimed to capture him discussing ways to rig the upcoming election.
[21] During the campaign for the 2024 New Hampshire Democratic presidential primary, over 20,000 voters received robocalls from an AI-impersonated President Joe Biden urging them not to vote.
[22][23] The New Hampshire attorney general said this violated state election laws, and alleged involvement by Life Corporation and Lingo Telecom.
The first breakthrough in this regard was introduced by WaveNet,[34] a neural network for generating raw audio waveforms capable of emulating the characteristics of many different speakers.
Indeed, both methods modify acoustic-spectral and style characteristics of the speech audio signal, but the Imitation-based usually keeps the input and output text unaltered.
[8] However, the scalability of machine learning methods is not confirmed due to excessive training and manual feature extraction, especially with many audio files.
Several metrics determine the level of accuracy of audio deepfake generation, and the most widely used is the mean opinion score (MOS), which is the arithmetic average of user ratings.
[54] The platform integrated sentiment analysis through DeepMoji for emotional expression and supported precise pronunciation control via ARPABET phonetic transcriptions.
It is also essential to consider more factors related to different accents that represent the way of pronunciation strictly associated with a particular individual, location, or nation.
For this reason, many researchers have suggested following a self-supervised learning approach,[59] dealing with unlabeled data to work effectively in detection tasks and improving the model's scalability, and, at the same time, decreasing the computational cost.
In addition, most of the effort is focused on detecting synthetic-based audio deepfakes, and few studies are analyzing imitation-based due to their intrinsic difficulty in the generation process.
[11] Over the years, there has been an increase in techniques aimed at defending against malicious actions that audio deepfake could bring, such as identity theft and manipulation of speeches by the nation's governors.
To prevent deepfakes, some suggest using blockchain and other distributed ledger technologies (DLT) to identify the provenance of data and track information.
[29] That way, those who create the generation models, perhaps for nefarious purposes, would not know precisely what features facilitate the detection of a deepfake,[29] discouraging possible attackers.
[74] DEEP-VOICE[75] is a publicly available dataset intended for research purposes to develop systems to detect when speech has been generated with neural networks through a process called Retrieval-based Voice Conversion (RVC).
Preliminary research showed numerous statistically-significant differences between features found in human speech and that which had been generated by Artificial Intelligence algorithms.