AI Chatbots in Health Care Need Safety Engineering, Not Just Smarter Models

August 18, 2025

Viewing: 17

A new study from the Icahn School of Medicine at Mount Sinai underscores a quietly growing risk in digital health: the tendency of AI chatbots to confidently spread false medical information. As large language models (LLMs) gain traction in clinical environments, supporting documentation, patient education, and even decision-making, the question of safety design has become urgent.

Mount Sinai’s researchers found that even a single fabricated medical term, such as a nonexistent disease or test, prompted popular AI models to generate elaborate, plausible-sounding explanations. Left unchecked, the tools routinely “hallucinated” medical misinformation. Yet the same study offered a silver lining: a brief, well-placed warning in the prompt cut these errors almost in half.

As the sector races toward wider deployment of AI tools in patient-facing and clinical support roles, these findings should serve as a design wake-up call. Intelligence alone does not ensure reliability. Safety must be engineered in from the start.

The Hallucination Problem in Clinical Context

Large language models are trained to predict words, not verify facts. This makes them powerful conversational tools—but also uniquely susceptible to confidently delivering inaccurate content. In health care, where misinformation can carry high risk, that vulnerability becomes a serious liability.

The Mount Sinai team tested several leading chatbots using fictional patient scenarios embedded with one false clinical term. Without any guardrails, the bots not only accepted the fake information but expanded on it, describing symptoms, causes, and treatments for conditions that do not exist.

This aligns with prior warnings from digital ethics experts and regulatory bodies. In 2023, the World Health Organization issued a caution against the unregulated use of generative AI in health, citing precisely this type of uncontrolled hallucination risk. And in 2024, the FDA’s Digital Health Advisory Committee began formal discussions around safety testing for LLMs used in health care software.

The Mount Sinai study adds valuable real-world data to this policy conversation. But more importantly, it points to an actionable path forward.

Prompt Engineering as a Safety Lever

In the study, researchers added a simple safety instruction to the user prompt—reminding the chatbot that the input may include inaccurate details. This small modification significantly reduced false elaboration by the model.

This finding is deceptively powerful. While most AI safety efforts focus on model architecture or post-output validation, prompt design is often overlooked. Yet in practice, prompt formulation is one of the easiest and most scalable levers developers can use to reduce risk.

For example, the National Institute of Standards and Technology (NIST) recently included prompt guidelines in its framework for AI risk management. And a growing number of clinical AI developers are investing in “instructional framing”—the use of embedded prompts to constrain or clarify chatbot behavior.

Mount Sinai’s findings show this is not just a theoretical benefit. A well-designed prompt can meaningfully limit harm.

Stress Testing as a Development Requirement

Perhaps more importantly, the Mount Sinai team introduced a simple stress-testing method that any health system or developer can replicate: embedding a false term and observing the model’s response. This “fake-term” test reveals how a system handles bad input, a critical yet often neglected dimension of validation.

It’s an approach that mirrors traditional software assurance principles: test how a tool behaves under failure conditions, not just in ideal cases. But in the fast-moving AI landscape, too many implementations rely on user-reported issues or retrospective audits rather than proactive testing.

Health systems evaluating chatbot tools, whether for patient triage, documentation support, or health education, should begin demanding stress-testing protocols as part of vendor diligence. Regulators, in turn, could require such tests for any system making clinical inferences.

As Girish Nadkarni, MD, MPH, Chair of Mount Sinai’s Windreich Department of AI and Human Health, stated, the challenge is not whether AI can be used in health care, but whether it can be engineered to “spot dubious input, respond with caution, and ensure human oversight remains central.” Testing those conditions must be part of the approval process.

Misplaced Trust in Confident Language

One reason these hallucinations are dangerous is their tone. Generative models don’t hedge. When asked about a fake condition, they often respond in crisp, confident medical language—creating a false sense of validity even for experienced clinicians.

This mirrors broader concerns about “automation bias,” a phenomenon in which users over-rely on technology even when it contradicts evidence or training. In clinical settings, this bias can lead to misdiagnosis, inappropriate treatments, or delayed escalation of care.

That’s why AI outputs, especially those involving patient education or decision support, should be labeled with provenance and confidence levels. Some LLMs already offer citation trails or uncertainty indicators, but these are inconsistently applied and not yet standardized.

Mount Sinai’s study underscores the importance of language tone in safety design. Developers should consider how formatting, disclaimers, and conversational style affect user trust—and calibrate their systems accordingly.

Engineering for Clinical Utility and Restraint

Ultimately, the goal is not to eliminate AI from health care but to create tools that are both intelligent and cautious. That means building for clinical utility—delivering timely, accurate insights—and for clinical restraint—knowing when to say “I don’t know.”

Mount Sinai’s team is now applying its testing methods to real, de-identified patient records and more advanced AI safety prompts. This next phase of research will help determine how scalable and generalizable their mitigation strategy is.

But even in its current form, the study offers a practical design framework:

Anticipate misinformation in the input, not just the output
Use prompt-based safety instructions as a first line of defense
Test with embedded falsehoods to reveal system weaknesses
Avoid overconfidence in language output by engineering for humility

AI models will only grow more sophisticated in the years ahead. But unless the health sector begins pairing that sophistication with robust, transparent safeguards, their benefits will remain shadowed by risk.

AI Chatbots, Hallucination, Mount Sinai, Prompt Engineering