The FDA’s AI Chabot, Elsa, is Failing

July 26, 2025

Photo 116282449 © Jarretera | Dreamstime.com

Viewing: 77

The Food and Drug Administration’s internal chatbot, Elsa, was supposed to help fix one of the agency’s most persistent operational burdens: sluggish, resource-intensive drug and device approvals. Instead, the tool is drawing internal skepticism, media scrutiny, and public confusion. Multiple FDA employees report that the chatbot frequently hallucinates references, fabricates clinical studies, and provides incorrect information on key topics such as drug labeling and review guidance.

The FDA has walked back initial claims about Elsa’s integration into formal workflows. The chatbot, officials now say, is confined to organizational assistance, not scientific review. But this rhetorical softening contrasts sharply with the broader federal messaging on artificial intelligence. The White House’s recently released AI Action Plan, part of a national directive from President Donald Trump, explicitly identifies healthcare as a lagging sector in AI adoption and calls for a “try-first” culture supported by regulatory sandboxes and Centers of Excellence. The contradiction between mandate and readiness is administrative.

A Tool Without a Role

Elsa was introduced as part of the FDA’s broader innovation portfolio, and its rollout was accompanied by a co-authored article in the Journal of the American Medical Association by agency leaders Marty Makary and Vinay Prasad. The article promoted AI as a central feature of the FDA’s modernization strategy. Yet internal reporting, including coverage from CNN and other outlets, indicates that the tool is not usable for clinical evaluations due to its lack of access to secure internal databases and a persistent tendency to deliver fabricated results.

The FDA’s Center for Biologics Evaluation and Research had positioned Elsa as a solution for information retrieval, document parsing, and process acceleration. But the tool appears to reflect a premature deployment rather than a validated implementation. These risks directly affect the credibility of AI initiatives in high-stakes public health contexts.

The problems reported with Elsa mirror issues documented in large language models across multiple industries. A 2024 peer-reviewed study in Nature Medicine found that clinical AI systems prone to hallucination pose distinct safety risks when embedded into decision support environments. Similarly, a recent National Academy of Medicine panel on AI safety concluded that low-visibility deployment pathways, where tools are rolled out internally without peer-reviewed validation, present a unique challenge for maintaining standards of scientific integrity in regulatory settings.

Political Momentum Versus Operational Readiness

The tension between the Trump administration’s push for expedited AI adoption and the operational realities within agencies like the FDA is beginning to show. The Department of Health and Human Services has criticized press coverage of Elsa’s flaws, framing the tool’s rollout as a success story distorted by disgruntled employees and outdated information. But these critiques do not address the core concern: the tool is currently unfit for the function it was initially pitched to support.

The White House’s call to establish regulatory sandboxes for AI testing is not inherently problematic. Structured, low-risk test environments can be powerful accelerants of innovation. But they require governance frameworks that define what success looks like, what risks are acceptable, and how evidence is collected before broader adoption. At present, there is no publicly available set of standards or transparency protocol to determine whether a tool like Elsa should move from organizational support to clinical relevance.

A 2023 GAO report on federal AI oversight warned that agencies lacked adequate internal capacity to evaluate AI tools objectively. The report emphasized the importance of interagency collaboration and independent technical review to prevent premature deployment. These recommendations remain unaddressed in current implementation narratives.

Implications for Broader Healthcare AI Use

Elsa’s failings have implications beyond the FDA. Hospitals, payers, and digital health startups are closely monitoring federal agency adoption trends to inform their own governance models. When a flagship federal AI deployment struggles with basic accuracy and integration, it signals unresolved gaps in the broader readiness framework.

This also raises questions about procurement standards. If agencies are encouraged to adopt AI under accelerated timelines but lack robust evaluation metrics, it sets a precedent that may bleed into clinical operations. For example, tools used to support prior authorization decisions, revenue cycle management, or population health analytics may face reduced internal scrutiny if modeled after lightly validated federal examples.

The lack of reliable AI validation creates friction between innovation teams and compliance officers inside hospitals. According to a 2025 HIMSS survey, nearly 60% of health systems now use at least one AI-driven application in their clinical or administrative environments. However, fewer than half have implemented formal post-deployment auditing protocols. The result is a growing disconnect between enthusiasm for automation and institutional capacity to detect and mitigate tool failure.

Governance Must Move Faster Than the Mandate

The central lesson from Elsa’s deployment is not that AI cannot improve regulatory efficiency. Rather, it is that ambition without structure risks undermining trust in the technology itself. The FDA’s role in validating new therapies is too critical to accommodate speculative systems that cannot meet minimum scientific and technical thresholds.

The Trump administration’s emphasis on speed, experimentation, and federal coordination is unlikely to slow. But without parallel investments in technical validation, workforce upskilling, and transparent auditability, AI deployments in healthcare will struggle to gain sustainable traction.

Organizations such as the National Academy of Sciences and the Brookings Institution have called for clear federal benchmarks for AI accuracy, reproducibility, and interpretability. These recommendations offer a starting point for transforming political momentum into durable operational reform.

Until then, tools like Elsa are unlikely to fulfill their promise, and may even impede it.

Chatbot, Elsa, FDA