At RSNA, An Examination of the Pitfalls in AI Model Development

In a session entitled “Best Practices for Continuous AI Model Evaluation,” a panel of experts on Tuesday, Nov. 27, shared their perspectives on the challenges involved in building AI models in radiology, during RSNA23, the annual conference of the Oak Brook, Ill.-based Radiological Society of North America, which was held Nov. 25-30 at Chicago’s McCormick Place Convention Center. All three—Matthew Preston Lundgren, M.D., M.P.H., Walter F. Wiggins, M.D., Ph.D., and Dania Daye, M.D., Ph.D.—are radiologists. Dr. Lundgren is CMIO at Nuance; Dr. Wiggins is a neuroradiologist and clinical director of the Duke Center for Artificial Intelligence in Radiology; Dr. Daye is an assistant professor of interventional radiology at Massachusetts General Hospital.

So, what are the key elements involved in clinical AI? Dr. Lundgren spoke first, and presented most of the session. He focused on the fact that the key is to construct an environment with data security protecting patient information, and recognizing that complete de-identification is difficult, while working in a cross-modality environment, leveraging the best of data science, and incorporating strong data governance into any process.

With regard to the importance of data governance, Lundgren told the assembled audience that, “In general, when we think about governance, we need a body that will oversee the implementation, maintenance, and monitoring of clinical AI algorithms. Someone has to decide what to deploy and how to deploy it (and who deploys it). We really need to ensure a structure that enhances quality, manages, resources, and ensures patient safety. And we need to create a stable, manageable system.”

What are the challenges involved, then, in establishing strong AI governance? Lundgren pointed to a four-step “roadmap.” Among the questions? “Who decides which algorithms to implement? What needs to be considered when assessing an algorithm for implementation? How does one implement a model in clinical practice? And, how does one monitor and maintain a model after implementation?”

With regard to governance, the composition of the AI governing body is an essential element, Lundgren said. “We see seven groups: clinical leadership, data scientists/AI experts, compliance representatives, legal representatives, ethics experts, IT managers, and end-users,” he said. “All seven groups need to be represented.” As for the governance framework, there has to be a multi-faceted focus on Ai auditing and quality assurance; AI research and innovation; training of staff; public, patient, practitioner involvement; leadership and staff management; and validation and evaluation.”

Lundgren went on to add that the governance pillars must incorporate “AI auditing and quality assurance; AI research and innovation; training of staff; public, patient, practitioner involvement; leadership and staff management; validation and evaluation.” And, per that, he added, “Safety really is at the center of these pillars. And having a team run your AI governance is very important.”

Lundgren identified five key responsibilities of any AI governing body:

 Defining the purposes, priorities, strategies, scope of governance

 Linking operation framework to organizational mission and strategy

 Developing mechanisms to decide which tools to be deployed

 Deciding how to allocate institutional and/or department resources

 Deciding which are the most valuable applications to dedicate resources to

And then, Lundgren said, it is crucial to consider how to integrate governance with clinical workflow assessment, workflow design, and workflow training.

Importantly, he emphasized, “Once an algorithm has been approved, responsible resources must work with vendors or internal developers for robustness and integration testing, with staged shadow and pilot deployments respectively.”

What about post-implementation governance? Lundgren identified four key elements for success:

 Maintenance and monitoring of AI applications just as vital to long-term success

 Metrics should be established prior to clinical implementation and monitored continuously to avert performance drift.

 Robust organizational structures to ensure appropriate oversight of algorithm deployment, maintenance, and monitoring.

 Governance bodies should balance desire for innovation with the practical aspects of maintaining clinician engagement and smooth operations.

Importantly, Lundgren added that “We need to evaluate models, but also need to monitor them in practice.” And that means “shadow deployment”—harmonizing acquisition protocols with what one’s vendor had expected to see—thick versus thin slices, for example. It’s important to run the model in the background and analyze ongoing performance, he emphasized—while at the same time, moving protocol harmonization forward, and potentially testing models before a subscription starts. For that to happen, one will have to negotiate with vendors.

Very importantly, Lundgren told the audience, “You need to train your end-users to use each AI tool. And in that regard, you need clinical champions who can work with the tools ahead of time and then train their colleagues. And they need to learn the basics of quality control, and you need to help them define what an auditable result will be: what is bad enough a stumble to flag for further review?”

And Lundgren spoke of the “Day 2 Problem.” What does it mean when performance drops at some point after Day 0 of implementation? He noted that, “Fundamentally, almost any AI tool has basic properties: models learn joint distribution of features and labels, and predict Y from X—in other words, they work based on inference. The problem is that when you deploy your model after training and validation, you don’t know what’s going to happen over time in your practice, with the data. So everyone is assuming stationarity in production—that everything will stay the same. But we know that things do not stay the same: indefinite stationarity is NOT a valid assumption. And data distributions are known to shift over time.”

Per that, he said, model monitoring will:

 Provide instant model performance metric

 No prior setup required

 Can be directly attributed to model performance

 Helps reason about large amounts of performance data

 Data monitoring: constantly checking new data

 Can it serve as a departmental data QC tool?

In the end, though, he conceded, “Real-time ground truth is difficult, expensive, and subjective. Expensive to come up with a new test set every time you have an issue.”