Medical research reveals that a significant portion, ranging between 5 to 13 percent, of advice dispensed by chatbots in health-related interactions may pose risks or be unsafe.
In a recent research paper titled "Large language models provide unsafe answers to patient-posed medical questions," it has been concluded that large language models (LLMs) currently provide a significant portion of unsafe medical advice to patients, posing notable risks in relying on publicly available chatbots for medical guidance.
The study, which focused on "advice-seeking" patient questions, found that between 5% and 13% of the medical advice given by such chatbots is dangerous or unsafe. The researchers curated a dataset of 222 advice-seeking medical questions from pediatrics, women's health, and other areas, and tested multiple large language models.
GPT-4o and Llama3 were identified as the models with the highest rate of unsafe answers, each around 13%. Claude, on the other hand, was the safest model, with an unsafe rate of 5%. Despite being the worst-performing model in the tests, Llama has been downloaded over a billion times and is the foundation model chosen by numerous health tech startups.
The study found that between 21% and 43% of responses from the tested chat models were rated as 'problematic'. Examples of dangerous advice included recommending tea tree oil near the eyes, giving infants water to drink, and shaking a child's head. Concerning results also included advice to breastfeed a child while infected with herpes, using tea tree oil to address crust on eyelids, giving water to children aged under six months, and treating the aftermath of miscarriage as a counseling opportunity rather than a cue for medical attention.
The criteria for rating the responses included categories such as 'Unsafe', 'Includes problematic content', 'Missing important information', and 'Missing history taking'. Responses were gathered from Anthropic's Claude 3.5 Sonnet, Google's Gemini 1.5 Flash, Meta's Llama 3.1, and OpenAI's ChatGPT-o4.
The authors of the study emphasize that millions of patients could be exposed to unsafe medical advice from such AI tools without significant improvements in clinical safety protocols. They suggest that further work is needed to improve the clinical safety of these tools.
In an expensive and less usable healthcare system, users are seeking to cut costs and corners through AI, despite the higher stakes involved. The authors concede that since the collection period, all the models studied have been updated, but note that not all behavioral change in LLMs will necessarily improve any particular use case.
The domain of critical medical advice has very little acceptable tolerance for error, and the use of AI in this field risks more compared to other disciplines. The authors define two types of potential patient questions: advice-seeking questions that directly invite diagnosis, and knowledge-seeking questions.
While LLMs demonstrate some clinical knowledge, their current safety for dispensing medical advice without human oversight is limited and potentially hazardous. Substantial work remains to improve their reasoning, reliability, and integration into clinical workflows before they can be safely used directly by patients for medical advice.
- The study in question reveals that large language models provide a significant risk in giving unsafe medical advice, with models like GPT-4o and Llama3 having an unsafe rate of around 13%.
- In the field of health-and-wellness, technology like AI holds promise, but the domain of critical medical advice, especially with regards to medical-conditions, necessitates additional safety measures, as errors can be potentially life-threatening.