AI chatbots are incapable of diagnosing patients solely through conversation

Don’t call your favorite AI “Doctor” yet

Just_Super/Getty Images

Advanced artificial intelligence models have scored highly in professional medical examinations, but they are still challenging one of the most important doctor tasks: talking to patients, gathering relevant medical information, and providing accurate diagnoses. I am still neglecting one thing.

“Large-scale language models perform well on multiple-choice tests, but their accuracy drops significantly on dynamic conversations,” he says. Pranav Rajpurkar at Harvard University. “Models especially struggle with open-ended diagnostic inference.”

This became clear when researchers developed a method to assess the reasoning ability of clinical AI models based on simulated doctor-patient conversations. “Patients” is based on 2000 medical cases drawn primarily from the United States Medical Board Specialty Examinations.

“Simulating patient interactions allows assessment of history-taking skills, which is an important element of clinical practice that cannot be assessed through case descriptions,” he says. shreya jolialso at Harvard University. The new assessment benchmark, called CRAFT-MD, “reflects real-world scenarios where patients may not know what details are important to share and may only disclose important information if prompted by specific questions. “I do,” she says.

The CRAFT-MD benchmark itself relies on AI. OpenAI's GPT-4 model acted as a “patient AI” that conversed with the “clinical AI” being tested. GPT-4 also helped score the results by comparing the clinical AI's diagnosis with the correct answer for each case. Human medical experts reconfirmed these assessments. We also reviewed the conversations to confirm the accuracy of the patient AI and whether the clinical AI was able to gather relevant medical information.

Multiple experiments have shown that the performance of four major large-scale language models (OpenAI's GPT-3.5 and GPT-4 models, Meta's Llama-2-7b model, and Mistral AI's Mistral-v2-7b model) is performance on benchmarks was shown to be significantly lower than at the time. Makes a diagnosis based on a written summary of the case. OpenAI, Meta, and Mistral AI did not respond to requests for comment.

For example, GPT-4's diagnostic accuracy was an impressive 82 percent when a structured case summary was presented and the diagnosis could be selected from a list of multiple-choice answers, but not when a multiple-choice option was provided. However, when it had to make a diagnosis from a simulated patient conversation, its accuracy dropped to just 26%.

And GPT-4 performs best among the AI models tested in this study, with GPT-3.5 often coming in second place, and Mistral AI models sometimes coming in second or third place. Meta's Llama models generally had the lowest scores.

AI models also failed to collect complete medical histories a significant proportion of the time, with the leading model, GPT-4, only able to do so in 71% of simulated patient conversations. Even if an AI model collects a patient's relevant medical history, it doesn't necessarily yield the correct diagnosis.

It says such simulated patient conversations are a “much more useful” way to assess an AI's clinical reasoning ability than medical tests. Eric Topol At the Scripps Research Institute Translational Institute in California.

Even if an AI model ultimately passes this benchmark and consistently makes accurate diagnoses based on conversations with simulated patients, it won't necessarily be better than a human doctor. says Rajpurkar. He points out that real-world medical procedures are “more troublesome” than simulations. That includes managing multiple patients, coordinating with medical teams, performing physical exams, and understanding the “complex social and systemic factors” in the local health care setting.

“While the strong performance in the benchmarks suggests that AI may be a powerful tool to support clinical practice, it does not necessarily replace the holistic judgment of experienced physicians.” says Rajpurkar.

topic:

Source: www.newscientist.com

What's Hot

Drop Duhi Review: A Challenging Block Drop Puzzle Experience

The oldest known termite mound, active 34,000 years ago, astounds scientists.

Doctors Explore Estrogen Therapy as a Preventive Measure for Women’s Dementia

Exploring the Limitations of AI Safety Management Practices

What is the likelihood of an asteroid impacting Earth?

Understanding Britain’s Debt Through Biscuits: How Labour MPs Embrace Viral Trends

Tesla Launches Affordable Model 3 in Europe Amid Criticism of Mask Sales

Horror Game Horses Banned: Is the Controversy Bigger Than You Think?

224,000-Year-Old Homo Skull Fragment Unveils New Insights into Human Origins

Did Early Snakes Burrow, Swim, or Crawl? 80 Million-Year-Old Fossils Reveal Surprising Insights

Juno’s Microwave Vision Unveils Jupiter’s Volcanic Moon Io: A Deep Dive into Its Hidden Secrets

How One Hot Dog Could Shorten Your Lifespan by 36 Minutes: The Shocking Truth

End-Triassic Mass Extinction: How Fern-Fueled Wildfires Ravaged Europe for Millennia

Top 4 Altcoins Unveiled by Expert for 100x Portfolio Growth: Blockchain News, Opinion, TV, Jobs

Blockchain experts forecast which tokens will generate profits

The Leading Platform for Seasoned Traders – Featuring Blockchain News, Insights, TV, and Job Listings

Darklume Fantasy Metaverse: Presale Now Available – Latest Blockchain Updates, Opinions, Television, and Job Listings

Sui collaborates with Google Cloud to drive Web3 advancement through improved security, scalability, and AI features

AI chatbots are incapable of diagnosing patients solely through conversation

224,000-Year-Old Homo Skull Fragment Unveils New Insights into Human Origins

Did Early Snakes Burrow, Swim, or Crawl? 80 Million-Year-Old Fossils Reveal Surprising Insights

Juno’s Microwave Vision Unveils Jupiter’s Volcanic Moon Io: A Deep Dive into Its Hidden Secrets

How One Hot Dog Could Shorten Your Lifespan by 36 Minutes: The Shocking Truth

End-Triassic Mass Extinction: How Fern-Fueled Wildfires Ravaged Europe for Millennia

Powerful Food Combinations to Maximize Nutrient Absorption

Did the Sun’s Twin Tilt Earth’s Orbit? – Discover the Shocking Findings on Sciworthy

Discovering the Truth About Liopleurodon: The Not-So-Giant Jurassic Pliosaur

Confused Child Shopper: ‘She Appeared 10, but Her Skin Was Irritated’

AI in Silicon Valley: Beyond Job Exchange to Total Replacement | Ed Newton Rex

Revealing the Shroud of Turin: Discovering DNA from Humans, Plants, and Animals

Transform Your Filmmaking: How New AI Tools Are Revolutionizing the Industry

UK Government to Renew Dispute with Apple Over Access to User Data | Data Protection

How Data Centers Use Glass Technology to Store Information for Thousands of Years

Most Popular

First Upright Apes Likely Evolved in Europe: New Findings Reveal Evolutionary Origins

DeepMind AI outperforms top weather forecasts, with one caveat

What's Hot

AI chatbots are incapable of diagnosing patients solely through conversation

Related Posts