Cargando…

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study

BACKGROUND: Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches....

Descripción completa

Detalles Bibliográficos
Autores principales: Mitchell, Joseph Ross, Szepietowski, Phillip, Howard, Rachel, Reisman, Phillip, Jones, Jennie D, Lewis, Patricia, Fridley, Brooke L, Rollison, Dana E
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8987958/
https://www.ncbi.nlm.nih.gov/pubmed/35319481
http://dx.doi.org/10.2196/27210
_version_ 1784682857611919360
author Mitchell, Joseph Ross
Szepietowski, Phillip
Howard, Rachel
Reisman, Phillip
Jones, Jennie D
Lewis, Patricia
Fridley, Brooke L
Rollison, Dana E
author_facet Mitchell, Joseph Ross
Szepietowski, Phillip
Howard, Rachel
Reisman, Phillip
Jones, Jennie D
Lewis, Patricia
Fridley, Brooke L
Rollison, Dana E
author_sort Mitchell, Joseph Ross
collection PubMed
description BACKGROUND: Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more. OBJECTIVE: The aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports. METHODS: We pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars. RESULTS: caBERTnet’s accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively. CONCLUSIONS: We have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.
format Online
Article
Text
id pubmed-8987958
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-89879582022-04-08 A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study Mitchell, Joseph Ross Szepietowski, Phillip Howard, Rachel Reisman, Phillip Jones, Jennie D Lewis, Patricia Fridley, Brooke L Rollison, Dana E J Med Internet Res Original Paper BACKGROUND: Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more. OBJECTIVE: The aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports. METHODS: We pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars. RESULTS: caBERTnet’s accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively. CONCLUSIONS: We have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes. JMIR Publications 2022-03-23 /pmc/articles/PMC8987958/ /pubmed/35319481 http://dx.doi.org/10.2196/27210 Text en ©Joseph Ross Mitchell, Phillip Szepietowski, Rachel Howard, Phillip Reisman, Jennie D Jones, Patricia Lewis, Brooke L Fridley, Dana E Rollison. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.03.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Mitchell, Joseph Ross
Szepietowski, Phillip
Howard, Rachel
Reisman, Phillip
Jones, Jennie D
Lewis, Patricia
Fridley, Brooke L
Rollison, Dana E
A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
title A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
title_full A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
title_fullStr A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
title_full_unstemmed A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
title_short A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
title_sort question-and-answer system to extract data from free-text oncological pathology reports (cancerbert network): development study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8987958/
https://www.ncbi.nlm.nih.gov/pubmed/35319481
http://dx.doi.org/10.2196/27210
work_keys_str_mv AT mitchelljosephross aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT szepietowskiphillip aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT howardrachel aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT reismanphillip aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT jonesjennied aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT lewispatricia aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT fridleybrookel aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT rollisondanae aquestionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT mitchelljosephross questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT szepietowskiphillip questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT howardrachel questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT reismanphillip questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT jonesjennied questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT lewispatricia questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT fridleybrookel questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy
AT rollisondanae questionandanswersystemtoextractdatafromfreetextoncologicalpathologyreportscancerbertnetworkdevelopmentstudy