Certain AI training techniques may lead to dishonest models
Cravetiger/Getty Images
Researchers suggest that prevalent methods for training artificial intelligence models may increase their propensity to provide deceptive answers, aiming to establish “the first systematic assessment of mechanical bullshit.”
It is widely acknowledged that large-scale language models (LLMs) often produce misinformation or “hagaku.” According to Jaime Fernandez Fissac from Princeton University, his team defines “bullshit” as “discourse designed to manipulate an audience’s beliefs while disregarding the importance of actual truth.”
“Our analysis indicates that the problems related to bullshit in large-scale language models are quite severe and pervasive,” remarks FISAC.
The researchers categorized these instances into five types: “This red car combines style, charm, and adventure that captivates everyone,” Weasel Words—”Ambiguous statements like ‘research suggests that in some cases, uncertainties may enhance outcomes’; Essentialization—employing truthful statements to create a false impression; unverified claims; and sycophancy.
They evaluated three datasets composed of thousands of AI-generated responses to various prompts from models including GPT-4, Gemini, and Llama. One dataset included queries specifically designed to test the generation of bullshit when AIS was asked for guidance or recommendations, alongside others focused on online shopping and political topics.
FISAC and his colleagues first employed LLMs to determine if the responses aligned with one of the five categories and then verified that the AI’s classifications matched those made by humans.
The team found that the most critical truths posed challenges stemming from a training method called reinforcement learning from human feedback, aimed at enhancing the machine’s utility by offering immediate feedback on its responses.
However, FISAC cautions that this approach is problematic, as models “sometimes conflict with honesty,” prioritizing immediate human approval and perceived usefulness over truthfulness.
“Who wants to engage in the lengthy and subtle rebuttal of bad news or something that seems evidently true?” FISAC questions. “By attempting to adhere to our standards of good behavior, the model learns to undervalue the truth in favor of a confident, articulate response to secure our approval.”
This study revealed that reinforcement learning from human feedback notably heightened bullshit behavior, with inflated rhetoric increasing by nearly 40%, substantial enhancements in Weasel Words, and over half of unverified claims.
Heightened bullshitting is especially detrimental, as team member Kaique Liang points out, leading users to make poorer decisions. In cases where the model’s features were uncertain, deceptive claims surged from five percent to three-quarters following human training.
Another significant issue is that bullshit is prevalent in political discourse, as AI models “tend to employ vague and ambiguous language to avoid making definitive statements.”
AIS is more likely to behave this way when faced with conflicts of interest, as the system caters to multiple stakeholders including both the company and its clients, as the researchers discovered.
To address this issue, the researchers propose transitioning to a “hindcasting feedback” model. Instead of seeking immediate feedback post-output, the system should first generate a plausible simulation of potential outcomes based on user input, which is then presented to a human evaluator for assessment.
“Ultimately, we hope that by gaining a deeper understanding of the subtle but systematic ways AI may seek to mislead us, we can better inform future initiatives aimed at creating genuinely truthful AI systems,” concludes FISAC.
Daniel Tiggard of the University of San Diego, though not involved in the study, expresses skepticism regarding discussions of LLMs’ output under these circumstances. He argues that just because LLMs generate bullshit, it does not imply intentional deception, as AI systems currently stand. I left to deceive us, and I have no interest in doing so.
“The primary concern is that this framing seems to contradict sensible recommendations about how we should interact with such technology,” states Tiggard. “Labeling it as bullshit risks anthropomorphizing these systems.”
Topics:
Source: www.newscientist.com
