Weill Cornell’s Thomas Campion Jr. on Making EHR Data ‘Research-Ready’
As its chief research informatics officer, Thomas Campion Jr., Ph.D., leads Weill Cornell Medicine's efforts to support researchers with the secondary use of electronic health record data. During a March 23 talk to the University of Michigan Medical School’s Learning Health System Collaboratory, he described his organization’s Architecture for Research Computing in Health (ARCH) program, which matches researchers with tools and services for obtaining electronic patient data.
Campion is an associate professor of research in population health sciences in the Division of Health Informatics at Weill Cornell Medical College. He has served as a co-investigator in multiple funded research initiatives, including the NIH’s RECOVER, N3C, ACT, and All of Us Research Program as well as PCORI’s INSIGHT Clinical Research Network. Nationally, he leads efforts to advance the secondary use of EHR data through the NIH CTSA consortium, Clinical Research Forum IT Roundtable, and Association of American Medical Colleges (AAMC) Group on Information Resources.
He began his talk by noting that it is pretty common these days for large academic medical system medical centers to use Epic across the entire system for clinical care. But he noted that there is no Epic for clinical research. “Instead, there's a variety of different tools and services, and navigating those can be rather challenging, because it's important to understand the strengths and limitations of those systems, but also the underlying data from the electronic health record.”
Programs like his must deal with issues of structured vs. unstructured data, data quality, availability, and other factors. “Finally, but perhaps most importantly, is the need to obtain regulatory approval to conduct research,” Campion said. “Patient privacy is of paramount importance, and it's really difficult to understand all the different hurdles that one must clear in order to conduct a study in not only one institution, but also across multiple institutions. This can involve obtaining approval from an institutional review board or IRB, as well as getting contracts approved and other measures in place.”
Taken together, he said, this becomes a very complex socio-technical challenge. “When you take a look across the literature, we really see that optimal approaches are unknown,” Campion said. “We don't know the best ways to do this. And in the field of biomedical informatics, we're constantly testing hypotheses to see what works and what doesn't to support the biomedical research enterprise with electronic patient data.”
Campion described how the medical schools, hospitals and physician groups in New York City used to be on different EHR systems, complicating research efforts. Then, about five years ago, a decision was made to extend the Cornell Epic instance across several institutions in New York, so now there is one Epic implementation supporting Weill Cornell Medicine, Columbia Doctors and New York Presbyterian.
“This is a somewhat unusual organizational arrangement in which we have one hospital system with two competing physician organizations with two competing medical schools and two competing research enterprises all sharing one EHR system,” he said. “This is really outstanding for patient care, but it can add some complexity for analytics and research in particular.”
He described some efforts oriented around the Weill Cornell Medicine sphere of the New York Presbyterian system and to support scientists with computational resources at Weill Cornell Medicine.
His group delivers a variety of services through their regular IT department — the same group that provides networking, e-mail, servers, security, and project management. Atop that foundation are three different divisions that support a spectrum of activities, from the conduct to the administration of research. They have a scientific computing group that provides all things high-performance computing. The research administrative computing team provides all the systems for compliance and planning. A research informatics division brings together efforts from partners in scientific computing and research administrative computing, but also from the clinical enterprise. “Through the efforts of these three divisions, plus other partners from across the tripartite landscape of Weill Cornell, Columbia, and New York Presbyterian, we deliver a suite of tools and services that we call Architecture for Research, Computing and Health (ARCH),” Campion said.
“We have conceptualized support for a variety of different studies spanning from the study of populations through the study of individuals. Science often happens in a statistical software package like SAS or Stata or R or Python, or even Microsoft Excel,” he explained. “Our job in research informatics is to deliver data sets that are immediately amenable to statistical analysis in one of these types of packages. We're often thinking about things in terms of rows and columns of data, because that's what those statistical software packages need so that our faculty, staff and students can do what they're best at — like conducting epidemiological analyses, generating new models, and contributing to biomedical discovery. That's what we really seek to enable.”
Their job is to take raw data from the source systems and transform it into research-ready data sets to make it available to investigators. “This takes a lot of time. We've estimated that more than 50 percent of our time, across our team of 30 or so people, is just focused on some of these data engineering tasks,” Campion said. “This remains, I think, a major challenge across the country and the globe for how we best support investigators.”
For quite some time, they were dealing with two different EHR systems supported by two different legal entities from two different IT departments. That included working with Epic and Allscripts before this Epic consolidation. “But through collaboration of New York, Presbyterian, Columbia and Cornell, we were able to come up with something called the Tripartite Request Assessment Committee (TRAC), which serves as a front door, a single place to go and request data to take some of that guesswork out of the way for investigators. There's just one place where they can request data. The fulfillment of those data requests happens behind the scenes.”
With funding from the Patient Centered Outcomes Research Institute, Weill Cornell is leading the Insight Clinical Research Network. This brings together all of the electronic health record data from all of the major academic medical centers in New York City in one database. Rainu Kaushal, the chair of population health sciences, and senior associate dean for clinical research, is the principal investigator of the Insight Network. “This is just a monumental achievement to bring together all these clinical competitors,” Campion said. “The right thing for research is to share these data elements. And this has been going on since about 2014 and has enabled a huge amount of discovery and research over time.”
Natural language processing
EHR data can loosely be categorized into two types: structured — things like diagnosis codes and procedure codes — and unstructured, such as physician notes, pathology reports, radiology reports. “Those notes are a gold mine of what's going on with patients, but there can be so much variation in the way that physicians document and that's why natural language processing (NLP) is important to get these elements out,” Campion said.
ARCH has worked on NLP efforts in coordination with clinical colleagues. One example is focused on under-represented populations. They saw that about 50 percent of patients did not have a structured value for race or ethnicity. Often, there was either nothing specified, it was null, or it was declined. “We posited that we could fill in the gaps by using NLP of notes to help get values of race and ethnicity,” he said, “and we were able to improve identification of patients who may be Black or Hispanic by more than 20 percent. This was primarily motivated to improve clinical trial enrollment for a study team at the institution that was seeking to address a huge need, and that's that clinical trials often do not include patients from traditionally underrepresented populations. With this mechanism, we're able to potentially address some of that gap.”
Campion said he and his colleagues have developed expertise with things like the HIPAA Privacy Rule. “But this is really hard for most investigators to be able to follow,” he added. “Should they have to be experts in this? Probably not. There is a concept from clinical medicine that I really like: performing at the top of your license. I think what we want to do in informatics is make sure that our clinician scientists and other scientist colleagues can perform at the top of their license as researchers. We can provide guidance, not just for technology and data matters, but potentially also for some of the regulatory matters.”
In summary, Campion said that research informatics is helping investigators get data out of EHR systems and to integrate data from disparate sources, all with the goal of creating research-ready data sets. “A lot of this work involves the ‘janitor’ work of collecting and cleaning data, which is a hugely expensive operational task and also a big part of fundamental research in biomedical informatics today. Self-service tools are critical to all of this,” he stressed. “It is going to be very, very difficult to have enough people to be able to respond to all of the individual requests that come from the research enterprise. I think we probably need a combination of self-service tools and investigator engagement to help boost data literacy. Although the tools can potentially help with some of the capacity issues, we still want to make sure that people can interrogate data thoughtfully.”