In April the National Center for Data to Health (CD2H) and the National Center for Advancing Translational Sciences (NCATS) announced they were leading the creation of a centralized, secure portal for hosting COVID-19 clinical data. After three months of rapid development work, the National COVID Cohort Collaborative (N3C) is getting ready to open its enclave to researchers sometime in the next few weeks.
The N3C will accept data via multiple data models and transform them into a common OMOP analytic model during data harmonization. The cloud-based collaborative portal will enable development of machine learning and other informatics tools that require a large row-level dataset. The analytics ecosystem being used is from a company called Palantir. The clinical institutions can work with a central IRB at Johns Hopkins University, which handles central IRB work on the All of Us precision medicine program. Contributors and researchers will sign data use agreements with NIH to support data ingestion into the cloud environment, and qualified researchers, clinicians and data contributors can request access via a data access committee.
At a July 10 meeting of the NIH Collaboratory Grand Rounds, Ken Gersing, M.D., director of Informatics for NCATS, described the progress made so far. He noted that in federated research data models, a question is created and sent to the sites, and answers are sent back. Conversely, N3C is a centralized model in which the data is sent to an enclave, and researchers bring their own tools to it. He said the advantage is that a centralized core allows for questions to be more open-ended and allows for more machine learning algorithm work. There also is a pilot project that involves using synthetic data working with a company called MDClone.
N3C has been rapidly organized into several work streams:
• Data partnerships and governance
• Phenotypes and data acquisition
• Data ingestion and harmonization
• Collaborative analytics
Gersing said that so far 49 organizations have executed data transfer agreements to submit data to N3C, and 27 have submitted requests to the IRB. The platform has ingested 10 sets of data. Eighty-five percent of the academic medical center participants in the NIH’s Clinical and Translational Science Awards (CTSA) program are participating so far. In addition, he said, 800 people have volunteered to help with the project. The data use agreement should be available next week, he said.
The phenotypes and data acquisition group has defined a COVID phenotype. “If COVID was tested, or even suspected, we are grabbing the entire medical record going back two years,” he said. “Each site has scripts that pull data and it comes into the harmonization core. We have asked sites to provide data every 48 to 72 hours.”
During an April 13 AMIA webinar, Oregon Health & Science University’s Melissa Haendel, Ph.D., CD2H program director, described the urgent need for the project: “We need better machine learning algorithms and algorithmic approaches to do things like perform rapid diagnosis, triage, and build predictive analytics,” she explained. “We also need best practices for resource allocation, how to best manage hospitals in this time of great need and we need to support informatics colleagues in delivering that information coming from clinics for discovery purposes. We believe all these things require the creation of a national comprehensive clinical data set to achieve these goals.”