Analytics/AI

Researchers: AI-Generated Clinical Summaries Need Finetuning

Feb. 6, 2024

LLMs summarizing clinical notes could introduce errors with effects on clinician decisions

65c2773a5df0e9001f58fb53 Dreamstime Xxl 280207928

Clinical applications of generative artificial intelligence (AI) and Large Language Models (LLM) are progressing; LLM-generated summaries can provide benefits and could replace many future EHR (Electronic Health Record) interactions. However, according to a team of researchers, LLMs summarizing clinical notes, medications, and other patient information lack the US Food and Drug Administration (FDA) oversight, which they see as a problem.

In a viewpoint article for the JAMA Network, published online on Jan. 29, Katherine E. Goodman, JD., Ph.D., Paul H. Yi, MD., and Daniel J. Morgan, MD., MS., wrote, “Simpler clinical documentation tools…create LLM-generated summaries from audio-recorded patient encounters. More sophisticated decision-support LLMs are under development that can summarize patient information from across the electronic health record (EHR). For example, LLMs could summarize a patient’s recent visit notes and laboratory results to create an up-to-date clinical “snapshot” before an appointment.”

Without standards for LLM-generated summaries, there’s a potential for patient harm, the article’s authors write. “Variations in summary length, organization, and tone could all nudge clinician interpretations and subsequent decisions either intentionally or unintentionally,” Goodman, Yi, and Morgan argued. The reason for summaries varying is that LLMs are probabilistic, and there is no correct response on which data to include and how to order it. Slight variations between prompts can impact the outputs. The JAMA network provided an example of a radiography report with notes of chills and a cough. The summary, in this instance, added the term “fever”. This added word completes an illness script and could affect the clinician’s diagnosis and recommended course of treatment.

The writers of the JAMA Network report, “[F]DA final guidance for clinical decision support software…provides an unintentional “roadmap” for how LLMs could avoid FDA regulation. Even LLMs performing sophisticated summarization tasks would not clearly qualify as devices because they provide general language-based outputs rather than specific predictions or numeric estimates of disease. With careful implementation, we expect that many LLMs summarizing clinical data could meet device-exemption criteria.”

The article’s authors recommend regulatory clarifications by the FDA, comprehensive standards, and clinical testing of LLM-generated summaries.