Study: ChatGPT Health underestimated the severity of medical emergencies 51.6% of the time and overestimated the severity of nonurgent cases 64.8% of the time

Researchers tested different medical scenarios with the chatbot. In more than half of cases where doctors would send a patient to the ER …

NBC News 2026-03-05 Kaan Ozcan

Context & Ripple Effects

This study extends a recent pattern of uneven health-related chatbot performance: an earlier test using Apple Health data found ChatGPT Health and Claude for Healthcare gave questionable, inconsistent responses.

It also lands as people increasingly share extensive personal records with chatbots despite privacy risks and generalized or inaccurate diagnoses, raising the stakes from isolated answers to triage-like guidance.

First-order effects

ChatGPT Health users could receive urgency guidance that conflicts with clinician judgment in both directions: emergencies may be down-triaged while nonurgent cases are escalated.
The results weaken the case for treating the product's medical responses as a standalone triage tool, particularly where an ER referral is the relevant decision.

Second-order effects

Healthcare organizations and consumer-health partners evaluating chatbot workflows will have stronger reason to require clinician review, scenario-based testing, and clearer escalation boundaries before relying on them.
Competing health AI offerings face the same evidentiary pressure: prior testing already found inconsistent answers from ChatGPT Health and Claude for Healthcare, so broad claims of dependable guidance become harder to sustain.

Third-order effects

If repeated across products and testing methods, health chatbots are likely to be positioned more narrowly as information or administrative assistance rather than as substitutes for clinical triage.
The core competitive distinction may shift from general chatbot fluency to demonstrable reliability on high-consequence cases, with independent evaluation becoming more consequential for adoption.

The trend: Consumer AI is moving into health use cases faster than evidence of consistent performance on safety-critical judgment tasks.

Discussion

@urocklive1 @urocklive1 on bluesky
Seems maybe Dr. Oz's plan to replace the rural hospitals that are closing with AI is not yet ready for Prime Time. — Open AI's health chatbot can pass medical exams, but is doing a bad job a correctly diagnosing health problems and assessing how critical they are. — www.nbcne…
@drianweissman @drianweissman on bluesky
ChatGPT Health ‘under-triaged’ half of medical emergencies in a new study. Researchers tested different medical scenarios with the chatbot. In more than half of cases in which doctors would send patients to the ER, the chatbot said it was OK to delay care. — www.nbcnews.com/h…
@cingraham Chris Ingraham on bluesky
First independent evaluation of ChatGPT Health: “Among gold-standard emergencies, the system under-triaged 52% of cases... Crisis intervention messages activated unpredictably across suicidal ideation presentations.” www.nature.com/articles/s41...
@erictopol Eric Topol on bluesky
🆕 at @naturemedicine.bsky.social — How does ChatGPT Health do for appropriately triaging a person as to whether to go to the emergency room or stay home? www.nature.com/articles/s41... Not very well. Under-triaged 52% of case vignettes that are considered gold-standard emerge…
@dannycrichton Danny Crichton on x
I find that ChatGPT for health is basically a neutral observer (which matches this data - everything it returns is the median crisis level). How critical you take the information it offers is really left up to the user, which isn't great

Chronicles