AI’s Hallucinations Are Intensifying—and They’re Here to Stay

Errors Tend to Occur with AI-Generated Content

Paul Taylor/Getty Images

AI chatbots from tech giants like OpenAI and Google have seen several inference upgrades in recent months. Ideally, these upgrades would lead to more reliable answers, but recent tests indicate that performance may be worse than that of previous models. Errors called “hallucinations,” particularly in the “hagatsuki” category, have been persistent issues that developers have struggled to eliminate.

Hallucination is the broad term used to describe specific errors generated by large-scale language models (LLMs) from organizations like OpenAI’s ChatGPT and Google’s Gemini. It primarily refers to instances where these models present false information as fact, but it can also describe instances where a generated answer is accurate yet irrelevant to the question posed.

A technical report from OpenAI evaluating the latest LLMs revealed that the O3 and O4-MINI models, released in April, exhibit significantly higher hallucination rates compared to earlier O1 models introduced in late 2024. For instance, if O4-MINI had a summary accuracy of 33%, the hallucination rate for O3 was similarly at 33%, whereas the O1 model maintained a rate of only 16%.

This issue is not exclusive to OpenAI. The popular leaderboard showcases various inference models from different companies assessing their hallucination rates, including the DeepSeek-R1 model. This model has shown increased hallucination rates compared to previous versions, undergoing several reasoning steps before reaching a conclusion.

An OpenAI spokesperson stated, “We are actively working to reduce hallucination rates in O3 and O4-MINI. Hallucinations are inherently more common in inference models. We will continue our research across all models to enhance accuracy and reliability.”

Some potential applications of LLMs can be significantly impeded by hallucinations. Models that frequently produce misinformation are unsuitable as research assistants, and a bot stating fictitious legal cases could endanger lawyers. Customer service agents falsely citing obsolete policies can also create significant challenges for businesses.

Initially, AI companies believed they would resolve these issues over time. Historically, models had shown reduced hallucinations with each update, yet the recent spikes in hallucination rates complicate this narrative.

Vectara’s leaderboard ranks models based on their consistency in summarizing documents. This indicates that for systems from OpenAI and Google, “hallucination rates are roughly comparable for inference and irrational models,” as noted by Forest Shen Bao from Vectara. Google has not provided further comments. For leaderboard assessments, the specific rates of hallucinations are less significant than each model’s overall ranking, according to Bao.

However, these rankings may not effectively compare AI models. For one, different types of hallucinations are often conflated. The Vectara team pointed out that the DeepSeek-R1 model demonstrated a 14.3% hallucination rate, but many of these hallucinations were “benign,” being logically deduced yet not appearing in the original text.

Another issue with these rankings is that tests based on text summaries “reveal nothing about the percentage of incorrect output” for tasks where LLMs are applied, as stated by Emily Bender at Washington University. She suggests that leaderboard results don’t provide a comprehensive evaluation of this technology, particularly since LLMs are not solely designed for text summarization.

These models generate answers by repeatedly answering the question, “What is the next word?” to formulate responses, thus not processing information in a traditional sense. However, many technology companies continue to use the term “hallucination” to describe output errors.

“The term ‘hallucination’ is doubly problematic,” says Bender. “On one hand, it implies that false output is abnormal and could potentially be mitigated, while on the other hand, it inaccurately anthropomorphizes the machine since large language models lack awareness.”

Arvind Narayanan from Princeton University argues that the issue extends beyond hallucinations. Models can also produce errors by utilizing unreliable sources or outdated information. Merely increasing training data and computational power may not rectify the problems.

We may have to accept the reality of error-prone AI, as Narayanan mentioned in a recent social media post. In some circumstances, it may be prudent to use such models solely for tasks requiring fact-checking. The best approach might be to avoid relying on AI chatbots for factual information altogether.

Source: www.newscientist.com

Despite Advances in Technology, AI Hallucinations Are Intensifying

Last month, AI bots managing technical support for cursors, emerging tools for computer programmers, informed numerous customers about alterations to the company’s policy. They stated that using cursors on a different computer was no longer permitted.

In a frustrated post on the Internet Message Board, a customer expressed their discontent. Some users even canceled their cursor accounts, and others were irate upon discovering the misunderstanding. AIBOT had mentioned a non-existent policy change.

“Such a policy does not exist. Users can indeed utilize their cursor across multiple devices.” I posted on Reddit. “Regrettably, this is an inaccurate response from the AI support bot.”

Two years post the launch of CHATGPT, tech companies, office workers, and everyday users have increasingly turned to AI bots for a diverse array of tasks. Yet, there remains no reliable mechanism to guarantee the accuracy of the information these systems provide.

The latest advanced technologies—so-called inference systems from firms like OpenAI, Google, and the Chinese startup Deepseek—are producing fewer errors. The connection to factuality has sharpened as the mathematical capabilities have enhanced. The exact reason for this improvement remains somewhat unclear.

Contemporary AI bots are built upon intricate mathematical structures that learn by analyzing vast amounts of digital data. They lack the ability to discern truth from falsehood. Sometimes, they fabricate information, leading some AI researchers to describe it as ‘hallucination.’ In one assessment, the hallucination rate for the new AI system reached 79%.

These models utilize mathematical probabilities to deduce the most appropriate response instead of adhering strictly to guidelines established by human engineers. Thus, errors are inevitable. “Despite our efforts, hallucination will always persist,” said Amr Awadallah, CEO of Vectara, a startup developing AI tools for enterprises and a former Google executive. “It’s unavoidable.”

For years, this issue has raised doubts concerning the reliability of these systems. While they can be beneficial in specific contexts, such as drafting term papers, summarizing office documents, or coding, their inaccuracies pose significant challenges.

AI bots integrated with search engines like Google or Bing can generate laughable and erroneous search results. If you inquire about a popular marathon on the West Coast, they might point you to a race in Philadelphia. When asked for household statistics in Illinois, they could cite a source that doesn’t contain that information.

While these hallucinations may not significantly affect many users, they present serious concerns for those relying on technology for legal documents, medical data, or sensitive business information.

“We invest substantial time discerning which responses are factual and which are not,” remarked Pratik Verma, co-founder and CEO of Okaff, a firm assisting businesses in navigating hallucination issues. “If these inaccuracies are not adequately addressed, the value of an AI system diminishes. The goal is to automate tasks.”

Cursor and Truell did not respond to requests for comments.

Over the past two years, firms such as OpenAI and Google have consistently enhanced their AI systems and decreased the frequency of these errors. However, the latest inference systems are showing an uptick in mistakes. According to internal evaluations, OpenAI’s newest systems hallucinate more often than their predecessors.

The company determined that O3 (its most advanced system) exhibited a 33% hallucination rate during the PersonQA benchmark tests, which involve answering questions about public figures—over twice the hallucination rate of their previous inference system named O1. The newly released O4-MINI showed an even steeper hallucination rate of 48%.

Another evaluation, SimpleQA, which poses more generalized questions, revealed hallucination rates of 51% and 79% for O3 and O4-MINI, respectively, while the earlier system, O1, came in at 44%.

In a paper outlining the tests, OpenAI noted that further research is required to understand these results. Given that AI systems learn from more data than a human can process, it is challenging for technicians to discern their behavior.

“Hallucination is not inherently common in reasoning models, but we are actively striving to decrease the percentage of hallucinations observed in O3 and O4-MINI,” Gaby Raila commented. “We will continue our exploration of hallucinations across all models to enhance accuracy and reliability.”

Hannane Hajisiltzi, a professor at the University of Washington and a researcher at the Allen Institute of Artificial Intelligence, is part of a team that recently developed methods to monitor the behavior of these systems. Trained individual data allows for some tracking. Nevertheless, this tool cannot clarify everything because the systems learn from a vast dataset capable of generating almost any output. “We still do not fully understand how these models operate,” she remarked.

Tests by independent organizations and researchers reveal that inference models from companies including Google and Deepseek are also showing rising hallucination rates.

Since late 2023, Vectara, Awadallah’s company, has been monitoring how frequently chatbots deviate from the truth. They assign these systems simple, verifiable tasks, such as summarizing particular news articles, yet chatbots continually fabricate information.

Initial surveys by Vectara estimated that, in this context, chatbots presented incorrect information at least 3% of the time and sometimes as high as 27%.

Over the next eighteen months, companies like OpenAI and Google reduced these figures to a range of 1% to 2%. Startups in San Francisco, such as Humanity, floated around 4%. Nevertheless, hallucination rates for this assessment have been rising alongside the advancement of inference systems. Deepseek’s reasoning model, R1, hallucinated 14.3% of the time, while OpenAI’s O3 reached 6.8%.

(The New York Times has filed a lawsuit against OpenAI and its partner Microsoft, claiming copyright infringement over news content related to AI systems. Both OpenAI and Microsoft have denied these allegations.)

For years, companies like OpenAI operated under the simplistic assumption that feeding more internet data into AI systems would enhance performance. However, they eventually exhausted nearly all online English text and required alternative methods to improve their chatbots.

Consequently, these companies are increasingly adopting what scientists refer to as reinforcement learning. In this approach, the system learns through trial and error, proving effective in specific domains like mathematics and computer programming, but lacking in others.

“The training approach for these systems tends to focus on one task while neglecting others,” commented Laura Perez-Bertracini, a researcher at the University of Edinburgh, who is part of a team investigating hallucination issues in depth.

Another drawback is that inference models are crafted to spend time “thinking” through complex problems before reaching answers. Consequently, as they solve problems step by step, they risk hallucination at each stage. Errors can compound as they linger over them.

The latest bots transparently reveal each step to users, meaning users can witness each mistake made. Researchers often assert that the steps indicated by bots are unrelated to the final answer.

“The system’s perception of ‘thinking’ does not necessarily equate to actual cognitive processing,” remarked Aryo Pradipta Gema, an AI researcher and fellow at the University of Edinburgh.

Source: www.nytimes.com