Mount Sinai: AI Coding Accuracy Improves When Models Are Taught to Look First

September 29, 2025

Photo 130409802 | Ai © Funtap P | Dreamstime.com

Viewing: 11

A new study from the Icahn School of Medicine at Mount Sinai offers a deceptively simple insight: large language models assign medical diagnosis codes more accurately when prompted to consult similar past cases before selecting a code. This “retrieval-augmented” approach improved performance across nine AI systems, including small open-source models, and in many cases surpassed physician accuracy in assigning ICD codes to emergency department discharge summaries.

The implications are not limited to academic validation. As value-based care models expand and administrative efficiency becomes a strategic differentiator, retrieval-augmented AI has the potential to shift how clinical documentation is interpreted, validated, and operationalized across health systems.

The question is no longer whether generative models can interpret clinical narratives, but whether health systems are building them to do it responsibly, transparently, and with structured support.

Lookup Before You Code

The Mount Sinai team evaluated 500 emergency department encounters by feeding physician notes into nine different AI models. Each model first generated a diagnostic description, which was then matched to 10 real-world examples from a database of over one million hospital records. This “lookup” step, drawing on historical diagnosis language and frequency data, allowed the model to revise its initial description before selecting an ICD code.

Compared to models that skipped the lookup step, retrieval-augmented systems performed more accurately, more consistently, and often better than clinicians. Importantly, even small, open-source models improved with access to contextual examples. The enhancement was not a function of scale, but of design.

As reported in NEJM AI, the study underscores a crucial shift in how AI can be engineered to serve, not replace, clinical coders and documentation teams. Retrieval augmentation adds a layer of interpretability and auditability to otherwise opaque model decisions, addressing one of the core barriers to adoption in clinical environments.

From Automation to Augmentation

Medical coding is a foundational administrative function, one that undergirds billing, population health analytics, reimbursement strategy, and compliance. Yet it remains one of the most resource-intensive and error-prone components of health IT operations. According to the Office of Inspector General, persistent documentation errors and code mismatches continue to cost Medicare billions annually.

Many EHR vendors have explored automation as a way to streamline coding workflows, but the results have been mixed. Automated systems often struggle with edge cases, rare diagnoses, or subtle clinical nuance. Worse, black-box models that cannot justify their code selections raise compliance red flags, particularly in risk-adjusted payment models like Medicare Advantage, where coding accuracy directly impacts reimbursement.

Retrieval-augmented AI offers a middle ground. By explicitly referencing similar historical cases, models provide both traceability and context. For revenue cycle leaders, this opens the door to hybrid workflows in which AI proposes codes, highlights discrepancies, or flags potential upcoding risks, without displacing human review.

As AHIMA has emphasized, the future of clinical documentation integrity (CDI) lies in assistive intelligence, not autonomous decision-making. Systems that support coders with contextually relevant, explainable suggestions will likely see faster adoption and stronger regulatory alignment than those promising full automation.

Technical Progress, Operational Gaps

Despite promising results, retrieval-based methods are not yet billing-approved. The Mount Sinai study focused only on primary diagnosis codes for emergency department visits discharged home. It did not assess procedural codes, secondary diagnoses, or inpatient encounters, all of which introduce additional complexity.

Still, the model’s structure is generalizable. Health systems could deploy retrieval-augmented AI in non-billing contexts today, such as:

Recommending codes during documentation review
Detecting mismatches before claims submission
Supporting chart audits and internal compliance reviews
Generating training data for coders and CDI staff

In parallel, vendors and academic collaborators are exploring how similar architectures could improve other high-friction workflows. A recent JAMA study on AI use in radiology reporting noted that retrieval augmentation also improved output consistency and reduced factual hallucinations.

But these methods require access to large, high-quality historical datasets, many of which remain fragmented across health systems. Interoperability and standardization remain essential if retrieval-enhanced systems are to function across multiple care sites or EHR platforms.

Beyond Performance: Ethics and Trust

The Mount Sinai research team is transparent about the limitations of its work. Retrieval-enhanced AI is not designed to replace clinicians or bypass human judgment. Rather, it is a mechanism to ground outputs in precedent, an essential step in building trust.

Transparency in AI coding is particularly important as scrutiny increases around algorithmic decision-making in healthcare. The GAO has warned that federal oversight remains limited, and that unvalidated models could introduce systemic bias, especially if trained on non-representative or outdated data.

By requiring the model to reference real-world examples, health systems can begin to move toward explainable, auditable AI—capable not just of proposing an answer, but of showing its work.

This shift is not just technical; it’s philosophical. It suggests that the future of AI in healthcare may hinge less on raw power, and more on design principles like humility, traceability, and structured support.

From Proof of Concept to Practice

Mount Sinai plans to integrate its retrieval-augmented method into the EHR for pilot testing, with the goal of expanding to broader diagnostic contexts. If successful, this approach could help alleviate the administrative load on clinicians, reduce coding errors, and improve the integrity of clinical data that underpins everything from reimbursement to analytics.

For health systems navigating AI implementation, this study offers a tangible roadmap: start with assistive use cases, ground model decisions in data, and prioritize design choices that promote clarity over control.

As pressure mounts to do more with less, fewer staff, tighter budgets, higher compliance threshold, the need for transparent, reliable, and context-aware AI tools will only grow. Retrieval-augmented coding is one such tool. It doesn’t replace the human—it respects them.

For this story, HIT Leaders and News used generative AI to help with an initial draft. An editor verified the accuracy of the information before publishing.

AI Coding, Icahn School of Medicine at Mount Sinai, ICD codes