Cargando…
Improving Information Extraction from Pathology Reports using Named Entity Recognition
Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two...
Autores principales: | , , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Journal Experts
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10350195/ https://www.ncbi.nlm.nih.gov/pubmed/37461545 http://dx.doi.org/10.21203/rs.3.rs-3035772/v1 |
_version_ | 1785074084037525504 |
---|---|
author | Zeng, Ken G. Dutt, Tarun Witowski, Jan Kranthi Kiran, GV Yeung, Frank Kim, Michelle Kim, Jesi Pleasure, Mitchell Moczulski, Christopher Lopez, L. Julian Lechuga Zhang, Hao Harbi, Mariam Al Shamout, Farah E. Major, Vincent J. Heacock, Laura Moy, Linda Schnabel, Freya Pak, Linda M. Shen, Yiqiu Geras, Krzysztof J. |
author_facet | Zeng, Ken G. Dutt, Tarun Witowski, Jan Kranthi Kiran, GV Yeung, Frank Kim, Michelle Kim, Jesi Pleasure, Mitchell Moczulski, Christopher Lopez, L. Julian Lechuga Zhang, Hao Harbi, Mariam Al Shamout, Farah E. Major, Vincent J. Heacock, Laura Moy, Linda Schnabel, Freya Pak, Linda M. Shen, Yiqiu Geras, Krzysztof J. |
author_sort | Zeng, Ken G. |
collection | PubMed |
description | Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two significant limitations. First, they typically frame their tasks as report classification, which restricts the granularity of extracted information. Second, they often fail to generalize to unseen reports due to variations in language, negation, and human error. To overcome these challenges, we propose a BERT (bidirectional encoder representations from transformers) named entity recognition (NER) system to extract key diagnostic elements from pathology reports. We also introduce four data augmentation methods to improve the robustness of our model. Trained and evaluated on 1438 annotated breast pathology reports, acquired from a large medical center in the United States, our BERT model trained with data augmentation achieves an entity F1-score of 0.916 on an internal test set, surpassing the BERT baseline (0.843). We further assessed the model’s generalizability using an external validation dataset from the United Arab Emirates, where our model maintained satisfactory performance (F1-score 0.860). Our findings demonstrate that our NER systems can effectively extract fine-grained information from widely diverse medical reports, offering the potential for large-scale information extraction in a wide range of medical and AI research. We publish our code at https://github.com/nyukat/pathology_extraction. |
format | Online Article Text |
id | pubmed-10350195 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | American Journal Experts |
record_format | MEDLINE/PubMed |
spelling | pubmed-103501952023-07-17 Improving Information Extraction from Pathology Reports using Named Entity Recognition Zeng, Ken G. Dutt, Tarun Witowski, Jan Kranthi Kiran, GV Yeung, Frank Kim, Michelle Kim, Jesi Pleasure, Mitchell Moczulski, Christopher Lopez, L. Julian Lechuga Zhang, Hao Harbi, Mariam Al Shamout, Farah E. Major, Vincent J. Heacock, Laura Moy, Linda Schnabel, Freya Pak, Linda M. Shen, Yiqiu Geras, Krzysztof J. Res Sq Article Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two significant limitations. First, they typically frame their tasks as report classification, which restricts the granularity of extracted information. Second, they often fail to generalize to unseen reports due to variations in language, negation, and human error. To overcome these challenges, we propose a BERT (bidirectional encoder representations from transformers) named entity recognition (NER) system to extract key diagnostic elements from pathology reports. We also introduce four data augmentation methods to improve the robustness of our model. Trained and evaluated on 1438 annotated breast pathology reports, acquired from a large medical center in the United States, our BERT model trained with data augmentation achieves an entity F1-score of 0.916 on an internal test set, surpassing the BERT baseline (0.843). We further assessed the model’s generalizability using an external validation dataset from the United Arab Emirates, where our model maintained satisfactory performance (F1-score 0.860). Our findings demonstrate that our NER systems can effectively extract fine-grained information from widely diverse medical reports, offering the potential for large-scale information extraction in a wide range of medical and AI research. We publish our code at https://github.com/nyukat/pathology_extraction. American Journal Experts 2023-07-03 /pmc/articles/PMC10350195/ /pubmed/37461545 http://dx.doi.org/10.21203/rs.3.rs-3035772/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Zeng, Ken G. Dutt, Tarun Witowski, Jan Kranthi Kiran, GV Yeung, Frank Kim, Michelle Kim, Jesi Pleasure, Mitchell Moczulski, Christopher Lopez, L. Julian Lechuga Zhang, Hao Harbi, Mariam Al Shamout, Farah E. Major, Vincent J. Heacock, Laura Moy, Linda Schnabel, Freya Pak, Linda M. Shen, Yiqiu Geras, Krzysztof J. Improving Information Extraction from Pathology Reports using Named Entity Recognition |
title | Improving Information Extraction from Pathology Reports using Named Entity Recognition |
title_full | Improving Information Extraction from Pathology Reports using Named Entity Recognition |
title_fullStr | Improving Information Extraction from Pathology Reports using Named Entity Recognition |
title_full_unstemmed | Improving Information Extraction from Pathology Reports using Named Entity Recognition |
title_short | Improving Information Extraction from Pathology Reports using Named Entity Recognition |
title_sort | improving information extraction from pathology reports using named entity recognition |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10350195/ https://www.ncbi.nlm.nih.gov/pubmed/37461545 http://dx.doi.org/10.21203/rs.3.rs-3035772/v1 |
work_keys_str_mv | AT zengkeng improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT dutttarun improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT witowskijan improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT kranthikirangv improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT yeungfrank improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT kimmichelle improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT kimjesi improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT pleasuremitchell improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT moczulskichristopher improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT lopezljulianlechuga improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT zhanghao improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT harbimariamal improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT shamoutfarahe improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT majorvincentj improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT heacocklaura improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT moylinda improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT schnabelfreya improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT paklindam improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT shenyiqiu improvinginformationextractionfrompathologyreportsusingnamedentityrecognition AT geraskrzysztofj improvinginformationextractionfrompathologyreportsusingnamedentityrecognition |