Cargando…

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chen, Yifu, Hao, Lucy, Zou, Vito Z., Hollander, Zsuzsanna, Ng, Raymond T., Isaac, Kathryn V.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9101856/ https://www.ncbi.nlm.nih.gov/pubmed/35549854 http://dx.doi.org/10.1186/s12874-022-01583-z

_version_	1784707189394374656
author	Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V.
author_facet	Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V.
author_sort	Chen, Yifu
collection	PubMed
description	BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data. METHODS: We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub. RESULTS: A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90. CONCLUSIONS: The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01583-z.
format	Online Article Text
id	pubmed-9101856
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-91018562022-05-14 Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V. BMC Med Res Methodol Research BACKGROUND: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data. METHODS: We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub. RESULTS: A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90. CONCLUSIONS: The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-022-01583-z. BioMed Central 2022-05-12 /pmc/articles/PMC9101856/ /pubmed/35549854 http://dx.doi.org/10.1186/s12874-022-01583-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Chen, Yifu Hao, Lucy Zou, Vito Z. Hollander, Zsuzsanna Ng, Raymond T. Isaac, Kathryn V. Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title	Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_full	Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_fullStr	Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_full_unstemmed	Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_short	Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
title_sort	automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9101856/ https://www.ncbi.nlm.nih.gov/pubmed/35549854 http://dx.doi.org/10.1186/s12874-022-01583-z
work_keys_str_mv	AT chenyifu automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT haolucy automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT zouvitoz automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT hollanderzsuzsanna automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT ngraymondt automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem AT isaackathrynv automatedmedicalchartreviewforbreastcanceroutcomesresearchanovelnaturallanguageprocessingextractionsystem

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Ejemplares similares