How can natural language processing extract insights from real-world evidence in healthcare

Medical imaging and biomedical diagnostics

Opinion

Real world evidence

DATE

March 24, 2023

AUTHOR

Ellie D Norris (Merck & Co. Inc)

The Language of Genomes

In this interview, we speak with Ellie D Norris (Merck & Co. Inc, NJ, USA) about how researchers are currently utilizing natural language processing to transform large volumes of unstructured data into structured data, which can subsequently be used to gain insights into real-world evidence in the healthcare sector. The interview also explores what sparked Ellie’s interest in AI and how she would like to see the field progress.

What sparked your interest in AI and how has the field progressed since then?

My interest in AI began in 2004 during my years in bioinformatics. At the time, a colleague was specializing in analyzing microarrays, which were laboratory tools used to detect the expression of thousands of genes concurrently. He had begun to apply machine learning algorithms and statistical methods towards several use cases concerning genetic variation and drug response. As I had a limited statistical background, and there were few educational options in machine learning, I enrolled in a postgraduate certificate program to become competent in biostatistics theory and practice. Soon after, I was assisting with several microarray analysis projects and eventually applying similar supervised learning methods independently to build classification models directly on the protein sequences of various drug receptors.

The field of AI has certainly exploded over the past decade due to the growth in datasets, the availability of scalable infrastructure, and, of course, the computational power provided by GPUs and TPUs. There are numerous educational programs for AI that were non-existent when I first pursued this path, including formal degree programs, boot camps, civic organizations, online training subscriptions, and massive open online courses (MOOCs). I am grateful to have returned to AI during the last 2 years and to be approaching it from a leadership role in which I am managing development squads and stewarding key competencies to design and implement intelligent applications.

How can natural language processing be applied to extracting insights from real-world evidence (RWE) in healthcare?

The US FDA defines real-world data (RWD) as “data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources” and real-world evidence (RWE) as “the clinical evidence about the usage and potential benefits or risks of a medical product derived from analysis of RWD”. Sources of RWD include electronic health records (EHR), insurance claims and billing activities, product registries, and data gathered from mobile health devices . The resulting RWE generated from these RWD sources can help to inform the drug development and commercialization processes and monitor post-market safety.³

Natural Language Processing is defined as the automatic manipulation of natural language by software and grew out of the field of linguistics. It is used to process and evaluate large amounts of natural language by transforming unstructured text into structured data. AI is a superset of natural language processing, as natural language processing now overlaps with machine learning and deep learning to enhance the computer processing of human language. These advanced methods fueled by large datasets and increasing computational power make it possible to normalize the text to a common data model and reduce its randomness through processes such as stemming and lemmatization . Therefore, natural language processing can analyze and extract key facts from the various RWD unstructured sources to generate the structured RWE and derive actionable insights from it. Making what used to be a laborious or impractical process can now be markedly facilitated by natural language processing techniques.

Have there been any particularly interesting examples of this recently?

Due to the increase in textual unstructured clinical data, the documentation of clinical notes and patient discharge summaries is an example of an immense data source, whether it originates as text or speech (e.g., a physician dictation). Natural language processing combined with machine learning and deep learning can be used to extract products, treatment outcomes and adverse event attributes from physician notes at scale to generate content that has not been stored in structured electronic medical records (EMR) datasets. These methods enable the effective analysis of previously hidden knowledge to improve reporting and analytics.

A recent example originates from a scientific paper in which the researchers performed medical abstraction through natural language processing by utilizing advanced deep learning techniques to scale real-world evidence generation. As these sophisticated models require large datasets with annotated examples, and previous work in clinical natural language processing has generally been limited to small datasets, their approach was to use available data from medical registries built for specific therapy areas. For this research, the selected registry data was then matched to the corresponding EMR data from patients to build a sufficiently sized dataset to train and evaluate the "deep natural language processing methods". The researchers believe they were the first to explore cross-document medical information extraction at the time of publication. The document types included radiology reports, pathology reports and progress notes, which incorporated dozens of clinical documents for a patient and collections that spanned decades. Their experiments illustrated that their sophisticated methods with multiple document types outperformed the previously published simplistic ones, which cannot handle complex semantics, as their most performant model combined transformers, domain-specific training, recurrent neural networks, and hierarchical attention. They also found that neural attention facilitated model interpretability as it highlighted the relevant portions of text behind its extraction rationale and thereby supported validation by subject matter experts .

Could you please explain what the SustaiNLP movement is and why it is important?

I learned of the aptly named SustaiNLP annual workshop about a year ago, which has aimed to promote "more sustainable natural language processing research and practices" since 2019. With the rise of increasingly complex and computationally intensive state-of-the-art models, which I have studied as part of an natural language processing working group hosted by Aggregate Intellect, this workshop focuses on efficiency and justifiability, to counteract the attention on ever-changing leaderboards and the "bigger is better" mentality, through the call for community-wide paper submissions. Suggested topics have included: (1) models that yield competitive performance but require less computational resources, (2) new best practices in reporting experimental results, and (3) critically analyzing existing evaluation protocols. The SustaiNLP workshop seeks to complement similar events on reproducibility and interpretability .

What is the next step for utilizing natural language processing to investigate RWE in healthcare and how would you like to see the field progress?

To date, natural language processing, and its subset natural language understanding (NLU) have been the primary focus for researchers and solution providers to identify and understand relevant entities from documentation as they handle syntactic parsing and semantic parsing respectively. These are the required methods in RWE data generation and analysis and significant advancements have been made in recent years.

However, the English language currently dominates the field of natural language processing and many of these advancements have therefore been accomplished in English, along with a smaller percentage of high-resource (i.e., based on the availability of Internet data) languages. We must appreciate that there are over 7K unique languages across the globe and English is only spoken by 17% of the world's population. As a result, there are many opportunities to close the gap to support international RWE generation for diverse patient populations.

One option is through the more complex natural language processing task of machine translation (MT), which automatically converts one human language to another while preserving its meaning. There has been growing momentum due to the emergence of neural methods that may support the existing high-resource languages. Particularly in the life sciences, the MT growth has been spurred by the increasing volume of (1) multilingual research that impacts drug development and (2) multilingual external consumer data shared through digital platforms. Due to the nature of the data, high-quality, accurate translations are of utmost importance and may require significant manual input initially as the captured translation memory gradually expands and becomes increasingly reusable.¹⁰

Interviewee profile:

My name is Ellie D. Norris, and I am honored to contribute to the launch of FMAI's digital site. Since receiving my academic degrees in biochemistry and bioinformatics during the late 1990s, I have acquired 24 years of experience in scientific R&D and information technology. My career began with over a decade as a bioinformatics scientist in drug discovery organizations. I then transitioned to information technology and held various positions pertaining to information architecture, product ownership, solution architecture and delivery orchestration to assist both research and development business functions.

For the majority of my career, I have worked for the pharmaceutical company Merck & Co. Inc . More specifically, I am part of the Application Engineering organization that supports the Clinical & Real-World Evidence Generation (CRWEG) Value Team. Within CRWEG, I serve as the Chapter Lead of our innovation capability named "Augmentation", which facilitates experiments to explore new technologies and problem-solving methods through the proof-of-concept phase, with a specialty in natural language processing) use cases. I also serve as the Digital Technologies Chapter Lead for select natural language processing products in which we use an expanded delivery team for the full system development life cycle through production deployment.

‍The opinions and views expressed in the article are the authors and do not necessarily represent the views of Merck & Co Inc. (NJ, USA), Future Medicine AI Hub or Future Science Group.

Expert Insight on the Future of Healthcare

with world-renowned experts