Cargando…

Improving Information Extraction from Pathology Reports using Named Entity Recognition

Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two...

Descripción completa

Detalles Bibliográficos
Autores principales: Zeng, Ken G., Dutt, Tarun, Witowski, Jan, Kranthi Kiran, GV, Yeung, Frank, Kim, Michelle, Kim, Jesi, Pleasure, Mitchell, Moczulski, Christopher, Lopez, L. Julian Lechuga, Zhang, Hao, Harbi, Mariam Al, Shamout, Farah E., Major, Vincent J., Heacock, Laura, Moy, Linda, Schnabel, Freya, Pak, Linda M., Shen, Yiqiu, Geras, Krzysztof J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Journal Experts 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10350195/
https://www.ncbi.nlm.nih.gov/pubmed/37461545
http://dx.doi.org/10.21203/rs.3.rs-3035772/v1
_version_ 1785074084037525504
author Zeng, Ken G.
Dutt, Tarun
Witowski, Jan
Kranthi Kiran, GV
Yeung, Frank
Kim, Michelle
Kim, Jesi
Pleasure, Mitchell
Moczulski, Christopher
Lopez, L. Julian Lechuga
Zhang, Hao
Harbi, Mariam Al
Shamout, Farah E.
Major, Vincent J.
Heacock, Laura
Moy, Linda
Schnabel, Freya
Pak, Linda M.
Shen, Yiqiu
Geras, Krzysztof J.
author_facet Zeng, Ken G.
Dutt, Tarun
Witowski, Jan
Kranthi Kiran, GV
Yeung, Frank
Kim, Michelle
Kim, Jesi
Pleasure, Mitchell
Moczulski, Christopher
Lopez, L. Julian Lechuga
Zhang, Hao
Harbi, Mariam Al
Shamout, Farah E.
Major, Vincent J.
Heacock, Laura
Moy, Linda
Schnabel, Freya
Pak, Linda M.
Shen, Yiqiu
Geras, Krzysztof J.
author_sort Zeng, Ken G.
collection PubMed
description Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two significant limitations. First, they typically frame their tasks as report classification, which restricts the granularity of extracted information. Second, they often fail to generalize to unseen reports due to variations in language, negation, and human error. To overcome these challenges, we propose a BERT (bidirectional encoder representations from transformers) named entity recognition (NER) system to extract key diagnostic elements from pathology reports. We also introduce four data augmentation methods to improve the robustness of our model. Trained and evaluated on 1438 annotated breast pathology reports, acquired from a large medical center in the United States, our BERT model trained with data augmentation achieves an entity F1-score of 0.916 on an internal test set, surpassing the BERT baseline (0.843). We further assessed the model’s generalizability using an external validation dataset from the United Arab Emirates, where our model maintained satisfactory performance (F1-score 0.860). Our findings demonstrate that our NER systems can effectively extract fine-grained information from widely diverse medical reports, offering the potential for large-scale information extraction in a wide range of medical and AI research. We publish our code at https://github.com/nyukat/pathology_extraction.
format Online
Article
Text
id pubmed-10350195
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Journal Experts
record_format MEDLINE/PubMed
spelling pubmed-103501952023-07-17 Improving Information Extraction from Pathology Reports using Named Entity Recognition Zeng, Ken G. Dutt, Tarun Witowski, Jan Kranthi Kiran, GV Yeung, Frank Kim, Michelle Kim, Jesi Pleasure, Mitchell Moczulski, Christopher Lopez, L. Julian Lechuga Zhang, Hao Harbi, Mariam Al Shamout, Farah E. Major, Vincent J. Heacock, Laura Moy, Linda Schnabel, Freya Pak, Linda M. Shen, Yiqiu Geras, Krzysztof J. Res Sq Article Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two significant limitations. First, they typically frame their tasks as report classification, which restricts the granularity of extracted information. Second, they often fail to generalize to unseen reports due to variations in language, negation, and human error. To overcome these challenges, we propose a BERT (bidirectional encoder representations from transformers) named entity recognition (NER) system to extract key diagnostic elements from pathology reports. We also introduce four data augmentation methods to improve the robustness of our model. Trained and evaluated on 1438 annotated breast pathology reports, acquired from a large medical center in the United States, our BERT model trained with data augmentation achieves an entity F1-score of 0.916 on an internal test set, surpassing the BERT baseline (0.843). We further assessed the model’s generalizability using an external validation dataset from the United Arab Emirates, where our model maintained satisfactory performance (F1-score 0.860). Our findings demonstrate that our NER systems can effectively extract fine-grained information from widely diverse medical reports, offering the potential for large-scale information extraction in a wide range of medical and AI research. We publish our code at https://github.com/nyukat/pathology_extraction. American Journal Experts 2023-07-03 /pmc/articles/PMC10350195/ /pubmed/37461545 http://dx.doi.org/10.21203/rs.3.rs-3035772/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Zeng, Ken G.
Dutt, Tarun
Witowski, Jan
Kranthi Kiran, GV
Yeung, Frank
Kim, Michelle
Kim, Jesi
Pleasure, Mitchell
Moczulski, Christopher
Lopez, L. Julian Lechuga
Zhang, Hao
Harbi, Mariam Al
Shamout, Farah E.
Major, Vincent J.
Heacock, Laura
Moy, Linda
Schnabel, Freya
Pak, Linda M.
Shen, Yiqiu
Geras, Krzysztof J.
Improving Information Extraction from Pathology Reports using Named Entity Recognition
title Improving Information Extraction from Pathology Reports using Named Entity Recognition
title_full Improving Information Extraction from Pathology Reports using Named Entity Recognition
title_fullStr Improving Information Extraction from Pathology Reports using Named Entity Recognition
title_full_unstemmed Improving Information Extraction from Pathology Reports using Named Entity Recognition
title_short Improving Information Extraction from Pathology Reports using Named Entity Recognition
title_sort improving information extraction from pathology reports using named entity recognition
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10350195/
https://www.ncbi.nlm.nih.gov/pubmed/37461545
http://dx.doi.org/10.21203/rs.3.rs-3035772/v1
work_keys_str_mv AT zengkeng improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT dutttarun improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT witowskijan improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT kranthikirangv improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT yeungfrank improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT kimmichelle improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT kimjesi improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT pleasuremitchell improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT moczulskichristopher improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT lopezljulianlechuga improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT zhanghao improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT harbimariamal improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT shamoutfarahe improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT majorvincentj improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT heacocklaura improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT moylinda improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT schnabelfreya improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT paklindam improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT shenyiqiu improvinginformationextractionfrompathologyreportsusingnamedentityrecognition
AT geraskrzysztofj improvinginformationextractionfrompathologyreportsusingnamedentityrecognition