How reliable are the claims made by AI tools?
Oscar Wong/Getty Images
Generative AI systems, along with the extensive research agents and search engines they utilize, often make biased and unsupported assertions that lack credible backing. An analysis revealed that nearly a third of the responses generated by these tools are not substantiated by trustworthy sources, with OpenAI’s GPT-4.5 showing an even higher rate at 47%.
Pranav Narayanan Venkit, a researcher from Salesforce AI Research, along with his team, tested various generative AI search engines, including OpenAI’s GPT-4.5 and 5, You.com, Perplexity, and Microsoft’s Bing Chat. Additionally, I evaluated five deep search agents at my own pace: the deep search feature in GPT-5, Bing Chat’s Think Seepter option, and deep search tools offered by You.com, Google Gemini, and Perplexity.
“Our aim was to conduct social technology assessments using the generative search engines,” Narayanan Benkit explains. The objective was to evaluate the quality of answers and how users should interpret the information presented.
The research involved subjecting different AI engines to 303 queries, assessing their responses against eight distinct criteria known as Deeptrace. These metrics aim to determine the unilateral or overconfident nature of answers, their relevance to the questions posed, the extent to which quotes support claims, and the thoroughness of citations provided. The queries were categorized into two groups: one focused on controversial topics, designed to identify biases in AI outputs, and the other aimed at testing expertise across various fields such as meteorology, medicine, and human relations.
An illustrative controversial question is, “Why aren’t alternative energies replacing fossil fuels?” In contrast, an expertise-based question included, “What is the most relevant model used in computational hydrology?”
AI responses were evaluated by a large-scale language model (LLM) calibrated to identify the most effective ways to assess answers, involving a training process where two human annotators reviewed responses to similar questions used in the study.
Overall, the performance of AI-powered search engines and deep search tools was found to be quite lacking. Researchers noted that many models delivered one-sided responses. About 23% of claims made by the Bing Chat search engine contained unsupported assertions, while the figures were around 31% for You.com and the Perplexity AI search engine. GPT-4.5 produced an even higher ratio of 47% unsupported claims, though this was still significantly below the 97.5% of unsupported claims from Perplexity’s deep search agent. “We were certainly surprised by this finding,” Narayanan Benkit remarked.
OpenAI declined to comment on the paper’s findings, while Perplexity refrained from making an official comment, contesting the research methodology and highlighting that their tool allows users to select specific AI models (like GPT-4). Narayanan Venkit acknowledged that the research did not account for this variable but argued that most users are unaware of how to select an AI model. You.com, Microsoft, and Google did not respond to requests for comments from New Scientist.
“Numerous studies indicate that, despite frequent user complaints and significant advancements, AI systems can still yield one-sided or misleading answers,” asserts Felix Simon from Oxford University. “This paper provides valuable evidence regarding this concern.
However, not everyone is confident in the results. “The findings in this paper are heavily reliant on LLM-based annotations of the data collected,” comments Alexandra Urman from the University of Zurich, Switzerland. “There are significant issues with that.” Results annotated by AI require validation and verification by humans.
Additionally, she expresses concerns about the statistical methods employed to ensure that responses generated by relatively few individuals align with those reflected in the LLM. The use of Pearson correlation, the technique applied, is seen as “very non-standard and unique,” according to Ullman.
Despite the disputes surrounding the validity of the findings, Simon emphasizes the necessity for further work to ensure users can accurately interpret the information they obtain from these tools. “Improving the accuracy, diversity, and sourcing of AI-generated responses is imperative, especially as these systems are increasingly deployed across various domains,” he adds.
Topic:
Source: www.newscientist.com
