Cargando…

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Yifu, Hao, Lucy, Zou, Vito Z., Hollander, Zsuzsanna, Ng, Raymond T., Isaac, Kathryn V.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9101856/
https://www.ncbi.nlm.nih.gov/pubmed/35549854
http://dx.doi.org/10.1186/s12874-022-01583-z
_version_ 1784707189394374656
author Chen, Yifu
Hao, Lucy
Zou, Vito Z.
Hollander, Zsuzsanna
Ng, Raymond T.
Isaac, Kathryn V.
author_facet Chen, Yifu
Hao, Lucy
Zou, Vito Z.
Hollander, Zsuzsanna
Ng, Raymond T.
Isaac, Kathryn V.
author_sort Chen, Yifu
collection PubMed
description BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data. METHODS: We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub. RESULTS: A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90. CONCLUSIONS: The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01583-z.
format Online
Article
Text
id pubmed-9101856
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-91018562022-05-14 Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V. BMC Med Res Methodol Research BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data. METHODS: We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub. RESULTS: A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90. CONCLUSIONS: The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01583-z. BioMed Central 2022-05-12 /pmc/articles/PMC9101856/ /pubmed/35549854 http://dx.doi.org/10.1186/s12874-022-01583-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Chen, Yifu
Hao, Lucy
Zou, Vito Z.
Hollander, Zsuzsanna
Ng, Raymond T.
Isaac, Kathryn V.
Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_full Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_fullStr Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_full_unstemmed Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_short Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_sort automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9101856/
https://www.ncbi.nlm.nih.gov/pubmed/35549854
http://dx.doi.org/10.1186/s12874-022-01583-z
work_keys_str_mv AT chenyifu automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem
AT haolucy automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem
AT zouvitoz automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem
AT hollanderzsuzsanna automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem
AT ngraymondt automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem
AT isaackathrynv automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem