[6] For example, a chatbot powered by large language models (LLMs), like ChatGPT, may embed plausible-sounding random falsehoods within its generated content.
Researchers have recognized this issue, and by 2023, analysts estimated that chatbots hallucinate as much as 27% of the time,[7] with factual errors present in 46% of generated texts.
[3] In 1995, Stephen Thaler demonstrated how hallucinations and phantom experiences emerge from artificial neural networks through random perturbation of their connection weights.
[10][11][12][13][14] In the early 2000s, the term "hallucination" was used in computer vision with a positive connotation to describe the process of adding detail to an image.
[15][16] In the late 2010s, the term underwent a semantic shift to signify the generation of factually incorrect or misleading outputs by AI systems in tasks like translation or object detection.
[20][21] Following OpenAI's ChatGPT release in beta-version in November 2022, some users complained that such chatbots often seem to pointlessly embed plausible-sounding random falsehoods within their generated content.
[22] Many news outlets, including The New York Times, started to use the term "hallucinations" to describe these models' occasionally incorrect or inconsistent responses.
[24] The term "hallucination" has been criticized by Usama Fayyad, executive director of the Institute for Experimental Artificial Intelligence at Northeastern University, on the grounds that it misleadingly personifies large language models, and that it is vague.
When encoders learn the wrong correlations between different parts of the training data, it could result in an erroneous generation that diverges from the input.
Data scientist Teresa Kubacka has recounted deliberately making up the phrase "cycloidal inverted electromagnon" and testing ChatGPT by asking it about the (nonexistent) phenomenon.
ChatGPT invented a plausible-sounding answer backed with plausible-looking citations that compelled her to double-check whether she had accidentally typed in the name of a real phenomenon.
[41] When prompted that "Scientists have recently discovered churros, the delicious fried-dough pastries ... (are) ideal tools for home surgery", ChatGPT claimed that a "study published in the journal Science" found that the dough is pliable enough to form into surgical instruments that can get into hard-to-reach places, and that the flavor has a calming effect on patients.
[45] In response, Brantley Starr of the Northern District of Texas banned the submission of AI-generated case filings that have not been reviewed by a human, noting that:[46][47] [Generative artificial intelligence] platforms in their current states are prone to hallucinations and bias.
A study conducted in the Cureus Journal of Medical Science showed that out of 178 total references cited by GPT-3, 69 returned an incorrect or nonexistent digital object identifier (DOI).
To show this, a group of researchers at the Northwestern University of Chicago generated 50 abstracts based on existing reports and analyzed their originality.
[54] From this information, the authors of this study concluded, "[t]he ethical and acceptable boundaries of ChatGPT's use in scientific writing remain unclear, although some publishers are beginning to lay down policies.
The high likelihood of returning non-existent reference material and incorrect information may require limitations to be put in place regarding these language models.
[56] Scientists have also found that hallucinations can serve as a valuable tool for scientific discovery, particularly in fields requiring innovative approaches to complex problems.
At the University of Washington, David Baker's lab has used AI hallucinations to design "ten million brand-new" proteins that don't occur in nature, leading to roughly 100 patents and the founding of over 20 biotech companies.
At California Institute of Technology, researchers used hallucinations to design a novel catheter geometry that significantly reduces bacterial contamination.
The design features sawtooth-like spikes on the inner walls that prevent bacteria from gaining traction, potentially addressing a global health issue that causes millions of urinary tract infections annually.
Anima Anandkumar, a professor at Caltech, emphasizes that these AI models are "taught physics" and their outputs must be validated through rigorous testing.
In meteorology, scientists use AI to generate thousands of subtle forecast variations, helping identify unexpected factors that can influence extreme weather events.
[57] At Memorial Sloan Kettering Cancer Center, researchers have applied hallucinatory techniques to enhance blurry medical images, while the University of Texas at Austin has utilized them to improve robot navigation systems.
These applications demonstrate how hallucinations, when properly constrained by scientific methodology, can accelerate the discovery process from years to days or even minutes.
[57] In Salon, statistician Gary N. Smith argues that LLMs "do not understand what words mean" and consequently that the term "hallucination" unreasonably anthropomorphizes the machine.
In July 2024, a White House report on fostering public trust in AI research mentioned hallucinations only in the context of reducing them.
[63] Text-to-audio generative AI – more narrowly known as text-to-speech (TTS) synthesis, depending on the modality – are known to produce inaccurate and unexpected results.
[64] Text-to-image models, such as Stable Diffusion, Midjourney and others, while impressive in their ability to generate images from text descriptions, often produce inaccurate or unexpected results.
[82] Furthermore, numerous tools like SelfCheckGPT,[83] the Trustworthy Language Model,[84] and Aimon[85] have emerged to aid in the detection of hallucination in offline experimentation and real-time production scenarios.