Sure! Here’s an SEO-optimized rewrite of your content while keeping the HTML tags intact:
AI Tools for Women’s Health: Incomplete Answers Oscar Wong/Getty Images
Current AI models frequently struggle to provide accurate diagnoses or advice for pressing women’s health inquiries.
Thirteen AI language models from OpenAI, Google, Anthropic, Mistral AI, and xAI were assessed with 345 medical questions spanning five fields, including emergency medicine, gynecology, and neurology. These questions were curated by 17 experts in women’s health, pharmacists, and clinicians from the US and Europe.
Expert reviewers analyzed the AI responses, cross-referencing failures against a medical expertise benchmark which includes 96 queries.
On average, 60% of the queries yielded inadequate responses based on expert evaluations. Notably, GPT-5 was the strongest performer, with a 47% failure rate, while Mistral 8B exhibited a significant 73% failure rate.
“I see more women using AI for health queries and decision support,” says Victoria-Elizabeth Gruber, a representative from Lumos AI, a firm focused on enhancing AI model assessments. She and her colleagues recognized the potential dangers of relying on technology that perpetuates existing gender imbalances in medical knowledge. “This inspired us to establish the first benchmark in this domain,” she explains.
Gruber expressed surprise over the high failure rates, stating, “We anticipated some disparities, but the variability among models was striking.”
This outcome is not unexpected, according to Kara Tannenbaum at the University of Montreal, Canada, as AI models are trained on historical data that may inherently contain biases. “It’s crucial for online health information sources and professional associations to enhance their web content with more detailed, evidence-based insights related to sex and gender to better inform AI,” she emphasizes.
Jonathan H. Chen from Stanford University notes that the claimed 60% failure rate may be misleading. “This figure is based on a limited expert-defined sample, which does not accurately represent regular inquiries from patients and doctors,” he asserts. “Some test scenarios are overly cautious and can lead to higher failure rates.” For instance, if a postpartum woman reports a headache, the model might fail if pre-eclampsia isn’t immediately suspected.
Gruber acknowledges such critiques, clarifying, “Our intent was not to label the model as broadly unsafe but to establish clear, clinically relevant evaluation criteria. We purposefully set strict benchmarks as minor omissions in the medical field can be significant in some cases.”
An OpenAI representative stated: “ChatGPT aims to support, not replace, healthcare services. We closely collaborate with clinicians globally to refine our models and continuously evaluate them to minimize harmful or misleading output. Our latest GPT-5.2 models are designed to consider critical user contexts, including gender. We take the accuracy of our outputs seriously, and while ChatGPT can offer valuable insights, we advise consulting qualified healthcare providers for treatment and care decisions.” Other companies involved in the study did not respond to requests for comments from New Scientist.
Topics:
This rewrite optimizes the content for SEO by naturally incorporating keywords related to AI in women’s health, improving clarity, and emphasizing critical insights throughout the piece.
Source: www.newscientist.com
