NYU Langone Developing Language Model as a Prediction Engine

Researchers from NYU Langone Health and tech company NVIDIA recently published a paper in the journal Nature about their use of a new large language model, NYUTron, that predicts a patient’s risk of 30-day readmission, as well as other clinical outcomes. Healthcare Innovation followed up with Eric Oermann, M.D., assistant professor of neurosurgery, radiology, and data science at NYU Grossman School of Medicine, and Mona Flores, M.D., global head of medical AI at NVIDIA, to discuss the potential impact of their work.

HCI: The general public has heard a lot about large language models in the last six or seven months. Is there something fundamentally different about the large language models than previous predictive analytics in the healthcare space?

Flores: Yes. What's specific about these language models, whether it is NYUTron from NYU, or work being done at the University of Florida, is that they are specifically trained on a corpus of clinical language. They are trained on patients’ notes as opposed to other models which have a bunch of data that is from PubMed or something else. These are specialized models for clinical language. We were able to show that these models work for very specific tasks as compared to other models out there.

HCI: I understand you trained NYUTron to do five predictive tasks in the hospital setting: 30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay prediction, and insurance denial prediction. Could you talk about why you chose those five and, and whether it did better on some than on others as far as its predictive capability?

Oermann: We chose them for two major reasons. We chose tasks that we felt were fairly well represented in the literature. And we wanted to choose a variety of tasks that were relevant to clinicians but also relevant to administrators and individuals who run healthcare systems such as more systemic concerns like reimbursement.

HCI: I want to read a sentence from your paper in Nature and ask you to expand on it a bit. “By rethinking all of medical predictive analytics as a natural language processing problem, we show that it's possible to use LLMs as a universal prediction engine for a wide variety of tests.” Can you talk about the significance of that framing of it?

Oermann: I think it comes from our experience of building predictive models in our health system. Clinicians or executives in leadership say, ‘Hey, we want to build a model to do X.’ And that frequently sets off a multi-month process of trying to figure out how to do this with various features that we pull out of our EHR or that maybe we have to engineer into the EHR. We go through this classical machine learning pipeline.

In some ways the point of this project was to change the very nature of that kind of operational lift of building predictive models to solve things in medicine. We know that every patient encounter, regardless of where it is, generates text written by clinicians that describes all the things we're looking for when we normally try to get together all these features. So if we could start with that text as our universal source of medical information and build our models off of that, then rather than having to do all this work to try to find what we want to use to make predictions, we could just use the text to make those predictions and get the results that we want faster and more effectively, which is the major conclusion of our study.

Flores: It’s almost like we created a Swiss Army Knife for clinical tasks. So now as opposed to having one knife for every task and having to go to the trouble of training a model and doing all of that work, you have one model and you are able with very minimal effort to fine-tune it for all of these different tasks. It is definitely the Swiss Army Knife of language models for healthcare.

HCI: It sounds like training the models on a large amount of unstructured clinical data is the key to what this is all about.

Flores: When you say unstructured, David, what's important about that is: imagine if you actually had to go and label all of this data, and the amount of effort and time and money that it would take to do that. The ability to now train these language models without labeling this unstructured data allows you to bring in data from so many different modalities without the pain and the cost of labeling data.

HCI: I understand you tested a group of six physicians at different levels of seniority against NYUTron in a head-to-head comparison of predictive capability on 30-day readmission. Could you talk about what you found there?

Oermann: We took a set of patients and a group of physicians of various experience levels to try to actually test this, and all of our physicians did worse than the model, other than one very, very senior physician who seems to have an uncanny ability to identify who's coming back to the hospital.

HCI: To study the generalizability of this across environments, you tried it out at two hospitals within the NYU Langone system. Can you summarize what you found there?

Oermann: We took our two largest hospitals in the health system — one is Tisch Hospital in Manhattan and the other is NYU Brooklyn in the Cobble Hill neighborhood. There's a large divergence in terms of the physicians who are there, so we felt like this would be a reasonable first test of how well the models generalize, We built one from across the health system and tested on each site individually, but then built one in Tisch and deployed it in Brooklyn and built one in Brooklyn and deployed it in Manhattan. In both cases, we found there was a performance drop, unsurprisingly, due to the dataset shift, as we say, but that you could salvage it to a certain extent by fine-tuning on additional data at that specific site.

HCI: What are your next steps for this work? A clinical trial?

Oermann: Yes, a clinical trial and other use cases. We're developing over 100 other tasks within our health system.

HCI: Could you give one or two more examples of other use cases that you're looking at?

Oermann: One that springs to mind is trying to predict nutrition — how well-fed people are based on the notes is something that nursing leadership was really interested in. That's something that demonstrates the importance of this approach because if we had to build any other model to try to figure out how well-fed patients are, I wouldn't even know where to start with that.