Researchers Mine N3C EHR Repository to Identify Long-COVID Patients

March 31, 2022

Researchers developed machine learning models to identify 100,263 potential long-COVID patients using the National COVID Cohort Collaborative’s (N3C) EHR repository

David Raths

A research team has used the National COVID Cohort Collaborative’s (N3C) EHR repository to develop machine learning models to identify potential long-COVID patients. Their research is part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which is addressing the need to understand long-COVID and identify treatments.

The research team produced a peer-reviewed paper, “Who has long-COVID? A big data approach,” to be published by Lancet Digital Health, and posted on the medRxiv pre-print server.

In a story on the Colorado-based UCHealth website, co-author Tell Bennett, M.D., head of the Informatics and Data Science section in the Department of Pediatrics at the University of Colorado School of Medicine, said the paper was the first produced by the RECOVER study, which is recruiting patients nationwide to study long COVID.

The researchers examined demographics, healthcare utilization, diagnoses, and medications for 97,995 adult COVID-19 patients. They used these features and 597 long-COVID clinic patients to train three machine learning models to identify potential long-COVID patients among (1) all COVID-19 patients, (2) patients hospitalized with COVID-19, and (3) patients who had COVID-19 but were not hospitalized.

Their models identified potential long-COVID patients with high accuracy. Important features include rate of healthcare utilization, patient age, dyspnea, and other diagnosis and medication information available within the EHR. Applying the “all patients” model to the larger N3C cohort identified 100,263 potential long-COVID patients.

Patients flagged by their models can be interpreted as “patients likely to be referred to or seek care at a long-COVID specialty clinic,” an essential proxy for long-COVID diagnosis in the current absence of a definition. They also helped identifying potential long-COVID patients for clinical trials. As more data sources are identified, the models can be retrained and tuned based on study needs.

The UCHealth website story quotes Bennett: “You need to find a way to winnow down to the people who would most benefit from or be most willing to participate in a clinical trial.” That’s always an arduous task. But machine learning makes it easier to dig through layers of detailed electronic medical records in search of those patients, he added.

The story also quotes co-author Sarah Jolley, M.D., assistant professor of Pulmonary Sciences & Critical Care Medicine at the University of Colorado School of Medicine and medical director of the Post-COVID Clinic at UCHealth’s University of Colorado Hospital: “Using the EHR to inform those pathways will increase access to more standardized post-COVID care, particularly in rural and underserved areas where patients may not have access to a specialized long COVID clinic,” she said. “For some providers who don’t see long COVID patients as frequently as we do, some of the symptoms are not as obvious or evident,” she said. A more precise definition of the condition, she added, will be “helpful to increase the awareness of its spectrum and will let providers know if patients are presenting with these symptoms, they should be believed.”