Medical Text Processing with the Healthcare Natural Language API

The FDA has a history of using real world evidence (RWE) as an integral component of the drug approval process. Moreover, RWE can mitigate the need for placebos in some clinical trials. The clinical records that make RWE evidence useful, however, often reside in unstructured formats, such as doctor’s notes, and must be “abstracted” into a clinical structured format. Cloud technologies and AI can help accelerate this process, making it significantly faster and more scalable.

Leading drug researchers are starting to augment their clinical trials with real world data for their FDA study submissions because it saves time and is more cost effective. Once the patient’s care concludes, the vast amounts of historical unstructured patient medical data ends up being a contributor to increasing storage needs. Unstructured data is key and critical in clinical decision support systems. In their original unstructured format, insights need a human to review the unstructured data. With no discrete data points from which insights can be quickly drawn, unstructured medical data can result in increased care gaps and care variances. Simple logic dictates that unassisted human abstraction alone is not fast or accurate enough to abstract all of this patient data. Applied natural language processing (NLP) using serverless software components on Google Cloud provides an efficient way of identifying and guiding clinical abstractors towards a prioritized list of patient medical documents.

How to run Medical Text Processing on Google Cloud

Using Google Cloud’s Vertex Workbench Jupyter Notebooks, you can create a data pipeline that takes raw clinical text documents and processes them through Google Cloud’s Healthcare Natural Language API landing the structured json output into BigQuery. From there, you can build a dashboard that can show clinical text characteristics, e.g., number of labels and relationships. From this, you’ll be able to build a trainable language model that can extract text and be further improved over time by human labeling.

To better understand how the solution addresses these challenges, let’s review the medical text entity extraction workflow:

Document AI for Data Ingestion. The system starts with a PDF file that contains de-identified medical text, such as a doctor’s hand-written notes or other unstructured text. This unstructured data is first processed by Document AI using optical character recognition (OCR) technology to digitize the text and images.
Natural Language Processing. The Cloud Natural Language API includes a set of pretrained models, including models for extracting and classifying medical text. The labels that are generated as part of the output of this service will serve as the “ground truth” labels for the Vertex AI AutoML service where additional, domain specific custom labels will be added.
Vertex AI AutoML. Vertex AI AutoML offers a machine learning toolset for human-in-the-loop dataset labeling and automatic label classification, using a Google model that your team can train with your data, even if team members possess little coding or data science expertise.
BigQuery Tables. NLP processed records are stored in BigQuery for further processing and visualization.
Looker Dashboard. The Looker Dashboard acts as the central “brain” for the clinical text abstraction process by serving visualizations that help the team identify the highest priority clinical documents using metrics like tag and concept “density.”
Python Jupyter Notebook. Use either Colab (free) or Vertex AI (enterprise) notebooks to explore your text data and call different APIs for ingestion and NLP.

The Healthcare Natural Language API

The Healthcare Natural Language API lets you efficiently run medical text entity resolution at scale by focusing on the following optimizations:

Optimizing document OCR and data extraction by using scalable Cloud Functions to run the document processing in parallel.
Optimizing cost and time to market by using completely serverless and managed services.
Facilitating a flexible and inclusive workflow that incorporates human-in-the-loop abstraction assisted by ML.

The following diagram shows the architecture of the solution.