Solving the Paper Crisis: Transforming Text Documents through AI

Data is beginning to be liberated from static documents—and that represents a sea change
Oct. 23, 2025
9 min read

Key Highlights

A huge amount of critical information remains locked up inside free-text-based documents in healthcare.

Among the key elements in that massive trove of data are pieces of information related to the social determinants of health, vital pieces of information needed to help improve and manage patient care and care management.

Advances in natural language processing (NLP), optical character recognition (OCR), and large language models (LLMs), are allowing patient care organization leaders to automate the extraction of data previously trapped inside free-text documents inside EHRs.

Healthcare has long struggled with a paradox. We live in an age of unprecedented digital sophistication—streaming platforms can anticipate what we want to watch before we do, and online retailers can predict what’s in our shopping cart weeks in advance. Yet in medicine, some of the most critical information about patients remains trapped inside static PDF files and scanned documents, locked away in formats that were never designed for clinical use. Nowhere is this more evident than in the realm of social determinants of health (SDOH), the non-medical factors that often dictate health outcomes more powerfully than any prescription.

The irony is striking. We know where someone lives, their access to food and transportation, their employment status, and even their housing stability can profoundly influence their health trajectory. And yet, even when these details make their way into electronic health records (EHRs), they often exist as unstructured, unsearchable text—buried in referral notes, intake forms, or social work assessments saved as PDFs. For clinicians trying to build a holistic picture of a patient’s life, this means critical information is either hidden, inconsistently recorded, or worse, lost entirely.

This is not just an inconvenience. It’s a structural barrier to better care. If a patient’s chart contains information about their housing insecurity but a physician never sees it, that insight cannot inform care plans, resource referrals, or risk stratification models. The very data we need to drive better healthcare outcomes remains functionally invisible.

A data liberation moment

Fortunately, we are on the cusp of a major shift. Thanks to advances in natural language processing (NLP), optical character recognition (OCR), and large language models (LLMs), the idea of liberating data from static documents is no longer a futuristic vision—it is happening now. These tools can rapidly scan PDFs, physician notes, intake forms, and other unstructured records, converting them into structured, standardized, and usable data that integrates seamlessly into an EHR. What once required manual chart reviews, tedious data entry, or entire teams of abstractors can now be done in seconds.

Imagine this in practice: a scanned referral letter notes that a patient has limited access to transportation. With the right NLP pipeline, that fact can be extracted, coded, and flagged directly in the EHR as a transportation-related SDOH risk. Suddenly, a physician reviewing the patient’s chart doesn’t need to comb through attachments—they see actionable data immediately. More importantly, care teams can proactively respond, whether by arranging telehealth visits, coordinating rides, or connecting the patient with community resources.

This is not about flashy AI gimmicks. It’s about making the data clinicians already have truly accessible and actionable.

From trapped data to clinical insight

The promise of this technology extends convenience. By breaking down data silos, healthcare organizations can:

1.    Build a more complete picture of the patient – Structured SDOH data, drawn from previously inaccessible sources, provides the context needed to treat the whole person, not just the disease.

2.    Improve care coordination – When social workers, primary care physicians, specialists, and case managers all have access to the same enriched dataset, patients are less likely to fall through the cracks.

3.    Reduce administrative burden – Automating data extraction reduces the hours clinicians spend on manual data entry.

4.    Enhance population health analytics – Aggregating structured SDOH data enables health systems to identify community-level risks, target interventions, and allocate resources more effectively.

5.    Drive equity in care – By shining a light on the social barriers that disproportionately affect vulnerable populations, this approach helps healthcare organizations move closer to equity-driven outcomes.

The shift is not hypothetical. Early adopters, like Watershed Health, are already demonstrating how structured extraction of unstructured documents leads to fewer missed diagnoses, more accurate risk stratification, and higher patient satisfaction.

Why this is the right kind of AI in healthcare

Of course, any mention of artificial intelligence in healthcare sparks legitimate concerns: Will machines replace clinicians? Will algorithms make life-or-death decisions? Will patient trust erode if technology takes too much of the wheel?

Here, the answer is reassuring. Using AI to unlock healthcare data is not about replacing judgment or clinical expertise—it’s about eliminating blind spots. It doesn’t change how physicians practice medicine; it ensures they practice with better, more complete information.

This is the right kind of AI application: narrow, reliable, and focused on reducing friction in the system rather than redefining it. It is not diagnosing patients, writing prescriptions, or making ethical decisions. It is simply ensuring that when a physician sits down to review a chart, they are not operating with partial information because key details are locked inside a PDF attachment.

In other words, AI here is an assistant, not a decider. It enhances access to actionable information without encroaching on the human elements of medicine that patients value most—empathy, trust, and judgment.

A call to action

The healthcare industry has a long history of letting technology overpromise and underdeliver. But in this case, the opportunity is too clear to ignore. We have the tools to unlock data that already exists in patient records and put it to work for better outcomes. The question is whether healthcare leaders will seize the moment.

EHR vendors must embrace interoperability and invest in integrating NLP and OCR pipelines directly into their platforms. Health systems should prioritize pilots that demonstrate how structured SDOH data improves care delivery and cost savings. Policymakers and payers should incentivize the capture and use of this data, recognizing that upstream social factors drive downstream healthcare spending.

For too long, clinicians have been forced to practice with one eye covered, lacking the full picture of their patients’ lives. By freeing SDOH and other data from their document prisons, we can finally equip providers with the clarity they need.

That future is not science fiction. It is within reach today.

If healthcare is serious about treating patients as whole people and addressing the social determinants that drive health outcomes, then we must get serious about liberating data. Unstructured documents should no longer be a graveyard for critical information. With the responsible application of AI, they can instead become a goldmine—powering better care, driving equity, and improving lives.

The revolution begins not by inventing new data, but by finally using the data we already have.

George Bosnjak is co-founder of Morph Services, an innovative AI start-up company.

 

About the Author

Sign up for our eNewsletters
Get the latest news and updates