AI's Hallucinations Are Intensifying—and They're Here to Stay

Errors Tend to Occur with AI-Generated Content

Paul Taylor/Getty Images

AI chatbots from tech giants like OpenAI and Google have seen several inference upgrades in recent months. Ideally, these upgrades would lead to more reliable answers, but recent tests indicate that performance may be worse than that of previous models. Errors called “hallucinations,” particularly in the “hagatsuki” category, have been persistent issues that developers have struggled to eliminate.

Hallucination is the broad term used to describe specific errors generated by large-scale language models (LLMs) from organizations like OpenAI’s ChatGPT and Google’s Gemini. It primarily refers to instances where these models present false information as fact, but it can also describe instances where a generated answer is accurate yet irrelevant to the question posed.

A technical report from OpenAI evaluating the latest LLMs revealed that the O3 and O4-MINI models, released in April, exhibit significantly higher hallucination rates compared to earlier O1 models introduced in late 2024. For instance, if O4-MINI had a summary accuracy of 33%, the hallucination rate for O3 was similarly at 33%, whereas the O1 model maintained a rate of only 16%.

This issue is not exclusive to OpenAI. The popular leaderboard showcases various inference models from different companies assessing their hallucination rates, including the DeepSeek-R1 model. This model has shown increased hallucination rates compared to previous versions, undergoing several reasoning steps before reaching a conclusion.

An OpenAI spokesperson stated, “We are actively working to reduce hallucination rates in O3 and O4-MINI. Hallucinations are inherently more common in inference models. We will continue our research across all models to enhance accuracy and reliability.”

Some potential applications of LLMs can be significantly impeded by hallucinations. Models that frequently produce misinformation are unsuitable as research assistants, and a bot stating fictitious legal cases could endanger lawyers. Customer service agents falsely citing obsolete policies can also create significant challenges for businesses.

Initially, AI companies believed they would resolve these issues over time. Historically, models had shown reduced hallucinations with each update, yet the recent spikes in hallucination rates complicate this narrative.

Vectara’s leaderboard ranks models based on their consistency in summarizing documents. This indicates that for systems from OpenAI and Google, “hallucination rates are roughly comparable for inference and irrational models,” as noted by Forest Shen Bao from Vectara. Google has not provided further comments. For leaderboard assessments, the specific rates of hallucinations are less significant than each model’s overall ranking, according to Bao.

However, these rankings may not effectively compare AI models. For one, different types of hallucinations are often conflated. The Vectara team pointed out that the DeepSeek-R1 model demonstrated a 14.3% hallucination rate, but many of these hallucinations were “benign,” being logically deduced yet not appearing in the original text.

Another issue with these rankings is that tests based on text summaries “reveal nothing about the percentage of incorrect output” for tasks where LLMs are applied, as stated by Emily Bender at Washington University. She suggests that leaderboard results don’t provide a comprehensive evaluation of this technology, particularly since LLMs are not solely designed for text summarization.

These models generate answers by repeatedly answering the question, “What is the next word?” to formulate responses, thus not processing information in a traditional sense. However, many technology companies continue to use the term “hallucination” to describe output errors.

“The term ‘hallucination’ is doubly problematic,” says Bender. “On one hand, it implies that false output is abnormal and could potentially be mitigated, while on the other hand, it inaccurately anthropomorphizes the machine since large language models lack awareness.”

Arvind Narayanan from Princeton University argues that the issue extends beyond hallucinations. Models can also produce errors by utilizing unreliable sources or outdated information. Merely increasing training data and computational power may not rectify the problems.

We may have to accept the reality of error-prone AI, as Narayanan mentioned in a recent social media post. In some circumstances, it may be prudent to use such models solely for tasks requiring fact-checking. The best approach might be to avoid relying on AI chatbots for factual information altogether.

Source: www.newscientist.com

What's Hot

Mushrooms transform food waste into gourmet dishes

US Congress moves to prohibit TikTok unless it severs connections with China

Research claims that Facebook is continuing to experiment with users in a bizarre manner

Exploring the Limitations of AI Safety Management Practices

What is the likelihood of an asteroid impacting Earth?

Understanding Britain’s Debt Through Biscuits: How Labour MPs Embrace Viral Trends

Tesla Launches Affordable Model 3 in Europe Amid Criticism of Mask Sales

Horror Game Horses Banned: Is the Controversy Bigger Than You Think?

224,000-Year-Old Homo Skull Fragment Unveils New Insights into Human Origins

Did Early Snakes Burrow, Swim, or Crawl? 80 Million-Year-Old Fossils Reveal Surprising Insights

Juno’s Microwave Vision Unveils Jupiter’s Volcanic Moon Io: A Deep Dive into Its Hidden Secrets

How One Hot Dog Could Shorten Your Lifespan by 36 Minutes: The Shocking Truth

End-Triassic Mass Extinction: How Fern-Fueled Wildfires Ravaged Europe for Millennia

Top 4 Altcoins Unveiled by Expert for 100x Portfolio Growth: Blockchain News, Opinion, TV, Jobs

Blockchain experts forecast which tokens will generate profits

The Leading Platform for Seasoned Traders – Featuring Blockchain News, Insights, TV, and Job Listings

Darklume Fantasy Metaverse: Presale Now Available – Latest Blockchain Updates, Opinions, Television, and Job Listings

Sui collaborates with Google Cloud to drive Web3 advancement through improved security, scalability, and AI features

AI’s Hallucinations Are Intensifying—and They’re Here to Stay

224,000-Year-Old Homo Skull Fragment Unveils New Insights into Human Origins

Did Early Snakes Burrow, Swim, or Crawl? 80 Million-Year-Old Fossils Reveal Surprising Insights

Juno’s Microwave Vision Unveils Jupiter’s Volcanic Moon Io: A Deep Dive into Its Hidden Secrets

How One Hot Dog Could Shorten Your Lifespan by 36 Minutes: The Shocking Truth

End-Triassic Mass Extinction: How Fern-Fueled Wildfires Ravaged Europe for Millennia

Powerful Food Combinations to Maximize Nutrient Absorption

Did the Sun’s Twin Tilt Earth’s Orbit? – Discover the Shocking Findings on Sciworthy

Discovering the Truth About Liopleurodon: The Not-So-Giant Jurassic Pliosaur

Elon Musk’s Twitter Week: What’s the Controversial CEO Been Tweeting About?

Apple fails to win EU court case challenging Ireland’s €13 billion tax bill

Behavior and evolution illuminated by 312-million-year-old fossil

Transform Your Filmmaking: How New AI Tools Are Revolutionizing the Industry

UK Government to Renew Dispute with Apple Over Access to User Data | Data Protection

How Data Centers Use Glass Technology to Store Information for Thousands of Years

Most Popular

How Menstrual Pads Can Provide Women with Insights into Fertility Changes

Zircuit staking reaches over $2 billion TVL in only 2 months – Blockchain Updates, Insights, Shows, Careers

What's Hot

AI’s Hallucinations Are Intensifying—and They’re Here to Stay

Errors Tend to Occur with AI-Generated Content

Related Posts