Scientific Laboratories: A Potential Hazard PeopleImages/Shutterstock
Researchers caution that the implementation of AI models in scientific laboratories poses risks, potentially leading to dangerous experiments that could result in fires or explosions. While these models offer a convincing semblance of understanding, they might lack essential safety protocols. Recent testing on 19 advanced AI models revealed that all of them are capable of making critical errors.
Although severe accidents in academic laboratories are uncommon, they are not unheard of. Chemist Karen Wetterhahn tragically lost her life in 1997 due to dimethylmercury penetrating her protective gloves. In another incident in 2016, a researcher suffered severe injuries from an explosion; and in 2014, another scientist was partially blinded.
AI models are increasingly being utilized across various industries, including research institutions, for experiment and procedure design. Specialized AI tools have demonstrated success in various scientific sectors, such as biology, meteorology, and mathematics. However, general-purpose models often generate inaccurate responses due to gaps in their data access. While this may be manageable in casual applications like travel planning or cooking, it poses life-threatening risks when devising chemical experiments.
To assess these risks, Zhang Xiangliang, a professor at the University of Notre Dame, developed LabSafety Bench, a testing mechanism that evaluates whether an AI model can recognize potential dangers and adverse outcomes. This includes 765 multiple-choice questions and 404 scenario-based illustrations that highlight safety concerns.
In multiple-choice assessments, some AI models, like Vicuna, scored barely above random guessing, while GPT-4o achieved an 86.55% accuracy rate, and DeepSeek-R1 reached 84.49%. In image-based evaluations, models like InstructBlip-7B demonstrated less than 30% accuracy. The team evaluated 19 state-of-the-art large-scale language models (LLMs) and vision-language models and found that none surpassed a 70% overall accuracy.
Although Zhang expresses optimism about the future of AI in scientific applications, particularly in “self-driving laboratories” where robots operate autonomously, he underscores that these models are not yet equipped to plan experiments effectively. “Currently? In the lab? I don’t think so. These models are primarily trained for general tasks, such as email drafting or paper summarization, excelling in those areas but lacking expertise in laboratory safety,” he states.
An OpenAI representative commented, “We welcome research aimed at making AI safe and reliable in scientific settings, particularly where safety is a concern.” They noted that the recent tests had not included any of their major models. “GPT-5.2 is the most advanced scientific model to date, offering enhanced reasoning, planning, and error detection capabilities to support researchers better while ensuring that human oversight remains paramount for safety-critical decisions.”
Requests for comments from Google, DeepSeek, Meta, Mistral, and Anthropic went unanswered.
Alan Tucker from Brunel University in London asserts that while AI models may prove incredibly useful for aiding human experiment design, their deployment must be approached cautiously. He emphasizes, “It’s evident that new generations of LLMs are being utilized inappropriately because of misplaced trust. Evidence suggests that people may be relying too heavily on AI to perform critical tasks without adequate oversight.”
Craig Malik, a professor at UCLA, shared his recent experience testing an AI model’s response to a hypothetical sulfuric acid spill. The correct procedure—rinsing with water—was contrary to the model’s repeated warnings against it, which instead offered unrelated advice about potential heat buildup. However, he noted that the model’s responses had improved in recent months.
Malik stressed the necessity of fostering robust safety practices among new students due to their inexperience. Yet he remains more optimistic than some peers about the role AI could play in experimental design, stating, “Are they worse than humans? While it’s valid to critique these large-scale models, it’s important to realize they haven’t been tested against a representative human cohort. Some individuals are very cautious, while others are not. It’s conceivable that these models could outperform a percentage of novice graduates or even experienced researchers. Moreover, these models are continuously evolving, indicating that the findings from this paper may be outdated within months.”
Topics:
Source: www.newscientist.com












