Duke Researchers Offer Frameworks for Evaluating Large-Language Models
Following last week’s announcement by the National Academy of Medicine of a Code of Conduct to guide responsible and equitable AI development, Duke University School of Medicine researchers have developed two new frameworks designed to evaluate the performance, safety, and reliability of large-language models in healthcare.
Published in npj Digital Medicine and the Journal of the American Medical Informatics Association (JAMIA), the studies offer a new approach to ensuring that AI systems used in clinical settings meet the highest standards of quality and accountability.
As large-language models become increasingly embedded in medical practice — generating clinical notes, summarizing conversations, and assisting with patient communications — health systems are grappling with how to assess these technologies in ways that are both rigorous and scalable. The Duke-led studies, under the direction of Chuan Hong, Ph.D., assistant professor in Duke’s Biostatistics and Bioinformatics, aim to fill that gap.
The npj Digital Medicine study introduces SCRIBE, a structured evaluation framework for ambient digital scribing tools. According to Duke, SCRIBE draws on expert clinical reviews, automated scoring methods, and simulated edge-case testing to evaluate how well these tools perform across dimensions such as accuracy, fairness, coherence, and resilience.
“Ambient AI holds real promise in reducing documentation workload for clinicians,” Hong said in a statement. “But thoughtful evaluation is essential. Without it, we risk implementing tools that might unintentionally introduce bias, omit critical information, or diminish the quality of care. SCRIBE is designed to help prevent that.”
A second, related study in JAMIA applies a complementary framework to assess large-language models used by the Epic platform to draft replies to patient messages. The research compares clinician feedback with automated metrics to evaluate aspects such as clarity, completeness, and safety. While the study found strong performance in tone and readability, it also revealed gaps in the completeness of responses — emphasizing the importance of continuous evaluation in practice.
“This work helps close the distance between innovative algorithms and real-world clinical value,” said Michael Pencina, Ph.D., chief data scientist at Duke Health, in a statement. He is a co-author of both studies. “We are showing what it takes to implement AI responsibly, and how rigorous evaluation must be part of the technology’s life cycle, not an afterthought,” he added.
The researchers said these frameworks form a foundation for responsible AI adoption in healthcare. They give clinical leaders, developers, and regulators the tools to assess AI models before deployment and monitor their performance over time — ensuring they support care delivery without compromising safety or trust.