Advanced artificial intelligence models have scored highly in professional medical examinations, but they are still challenging one of the most important doctor tasks: talking to patients, gathering relevant medical information, and providing accurate diagnoses. I am still neglecting one thing.
“Large-scale language models perform well on multiple-choice tests, but their accuracy drops significantly on dynamic conversations,” he says. Pranav Rajpurkar at Harvard University. “Models especially struggle with open-ended diagnostic inference.”
This became clear when researchers developed a method to assess the reasoning ability of clinical AI models based on simulated doctor-patient conversations. “Patients” is based on 2000 medical cases drawn primarily from the United States Medical Board Specialty Examinations.
“Simulating patient interactions allows assessment of history-taking skills, which is an important element of clinical practice that cannot be assessed through case descriptions,” he says. shreya jolialso at Harvard University. The new assessment benchmark, called CRAFT-MD, “reflects real-world scenarios where patients may not know what details are important to share and may only disclose important information if prompted by specific questions. “I do,” she says.
The CRAFT-MD benchmark itself relies on AI. OpenAI's GPT-4 model acted as a “patient AI” that conversed with the “clinical AI” being tested. GPT-4 also helped score the results by comparing the clinical AI's diagnosis with the correct answer for each case. Human medical experts reconfirmed these assessments. We also reviewed the conversations to confirm the accuracy of the patient AI and whether the clinical AI was able to gather relevant medical information.
Multiple experiments have shown that the performance of four major large-scale language models (OpenAI's GPT-3.5 and GPT-4 models, Meta's Llama-2-7b model, and Mistral AI's Mistral-v2-7b model) is performance on benchmarks was shown to be significantly lower than at the time. Makes a diagnosis based on a written summary of the case. OpenAI, Meta, and Mistral AI did not respond to requests for comment.
For example, GPT-4's diagnostic accuracy was an impressive 82 percent when a structured case summary was presented and the diagnosis could be selected from a list of multiple-choice answers, but not when a multiple-choice option was provided. However, when it had to make a diagnosis from a simulated patient conversation, its accuracy dropped to just 26%.
And GPT-4 performs best among the AI models tested in this study, with GPT-3.5 often coming in second place, and Mistral AI models sometimes coming in second or third place. Meta's Llama models generally had the lowest scores.
AI models also failed to collect complete medical histories a significant proportion of the time, with the leading model, GPT-4, only able to do so in 71% of simulated patient conversations. Even if an AI model collects a patient's relevant medical history, it doesn't necessarily yield the correct diagnosis.
It says such simulated patient conversations are a “much more useful” way to assess an AI's clinical reasoning ability than medical tests. Eric Topol At the Scripps Research Institute Translational Institute in California.
Even if an AI model ultimately passes this benchmark and consistently makes accurate diagnoses based on conversations with simulated patients, it won't necessarily be better than a human doctor. says Rajpurkar. He points out that real-world medical procedures are “more troublesome” than simulations. That includes managing multiple patients, coordinating with medical teams, performing physical exams, and understanding the “complex social and systemic factors” in the local health care setting.
“While the strong performance in the benchmarks suggests that AI may be a powerful tool to support clinical practice, it does not necessarily replace the holistic judgment of experienced physicians.” says Rajpurkar.
topic:
Source: www.newscientist.com