Cargando…
Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9101856/ https://www.ncbi.nlm.nih.gov/pubmed/35549854 http://dx.doi.org/10.1186/s12874-022-01583-z |
_version_ | 1784707189394374656 |
---|---|
author | Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V. |
author_facet | Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V. |
author_sort | Chen, Yifu |
collection | PubMed |
description | BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data. METHODS: We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub. RESULTS: A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90. CONCLUSIONS: The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01583-z. |
format | Online Article Text |
id | pubmed-9101856 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-91018562022-05-14 Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V. BMC Med Res Methodol Research BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data. METHODS: We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub. RESULTS: A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90. CONCLUSIONS: The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01583-z. BioMed Central 2022-05-12 /pmc/articles/PMC9101856/ /pubmed/35549854 http://dx.doi.org/10.1186/s12874-022-01583-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V. Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system |
title | Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system |
title_full | Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system |
title_fullStr | Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system |
title_full_unstemmed | Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system |
title_short | Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system |
title_sort | automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9101856/ https://www.ncbi.nlm.nih.gov/pubmed/35549854 http://dx.doi.org/10.1186/s12874-022-01583-z |
work_keys_str_mv | AT chenyifu automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT haolucy automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT zouvitoz automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT hollanderzsuzsanna automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT ngraymondt automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT isaackathrynv automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem |