Cargando…

Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports

As part of its core business of gathering population-based information on new cancer diagnoses, the Belgian Cancer Registry receives free-text pathology reports, describing results of (pre-)malignant specimens. These reports are provided by 82 laboratories and written in 2 national languages, Dutch...

Descripción completa

Detalles Bibliográficos
Autores principales: Pironet, Antoine, Poirel, Hélène A., Tambuyzer, Tim, De Schutter, Harlinde, van Walle, Lien, Mattheijssens, Joris, Henau, Kris, Van Eycken, Liesbet, Van Damme, Nancy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8522027/
https://www.ncbi.nlm.nih.gov/pubmed/34713168
http://dx.doi.org/10.3389/fdgth.2021.692077
_version_ 1784585011987480576
author Pironet, Antoine
Poirel, Hélène A.
Tambuyzer, Tim
De Schutter, Harlinde
van Walle, Lien
Mattheijssens, Joris
Henau, Kris
Van Eycken, Liesbet
Van Damme, Nancy
author_facet Pironet, Antoine
Poirel, Hélène A.
Tambuyzer, Tim
De Schutter, Harlinde
van Walle, Lien
Mattheijssens, Joris
Henau, Kris
Van Eycken, Liesbet
Van Damme, Nancy
author_sort Pironet, Antoine
collection PubMed
description As part of its core business of gathering population-based information on new cancer diagnoses, the Belgian Cancer Registry receives free-text pathology reports, describing results of (pre-)malignant specimens. These reports are provided by 82 laboratories and written in 2 national languages, Dutch or French. For breast cancer, the reports characterize the status of estrogen receptor, progesterone receptor, and Erb-b2 receptor tyrosine kinase 2. These biomarkers are related with tumor growth and prognosis and are essential to define therapeutic management. The availability of population-scale information about their status in breast cancer patients can therefore be considered crucial to enrich real-world scientific studies and to guide public health policies regarding personalized medicine. The main objective of this study is to expand the data available at the Belgian Cancer Registry by automatically extracting the status of these biomarkers from the pathology reports. Various types of numeric features are computed from over 1,300 manually annotated reports linked to breast tumors diagnosed in 2014. A range of popular machine learning classifiers, such as support vector machines, random forests and logistic regressions, are trained on this data and compared using their F(1) scores on a separate validation set. On a held-out test set, the best performing classifiers achieve F(1) scores ranging from 0.89 to 0.92 for the four classification tasks. The extraction is thus reliable and allows to significantly increase the availability of this valuable information on breast cancer receptor status at a population level.
format Online
Article
Text
id pubmed-8522027
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-85220272021-10-27 Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports Pironet, Antoine Poirel, Hélène A. Tambuyzer, Tim De Schutter, Harlinde van Walle, Lien Mattheijssens, Joris Henau, Kris Van Eycken, Liesbet Van Damme, Nancy Front Digit Health Digital Health As part of its core business of gathering population-based information on new cancer diagnoses, the Belgian Cancer Registry receives free-text pathology reports, describing results of (pre-)malignant specimens. These reports are provided by 82 laboratories and written in 2 national languages, Dutch or French. For breast cancer, the reports characterize the status of estrogen receptor, progesterone receptor, and Erb-b2 receptor tyrosine kinase 2. These biomarkers are related with tumor growth and prognosis and are essential to define therapeutic management. The availability of population-scale information about their status in breast cancer patients can therefore be considered crucial to enrich real-world scientific studies and to guide public health policies regarding personalized medicine. The main objective of this study is to expand the data available at the Belgian Cancer Registry by automatically extracting the status of these biomarkers from the pathology reports. Various types of numeric features are computed from over 1,300 manually annotated reports linked to breast tumors diagnosed in 2014. A range of popular machine learning classifiers, such as support vector machines, random forests and logistic regressions, are trained on this data and compared using their F(1) scores on a separate validation set. On a held-out test set, the best performing classifiers achieve F(1) scores ranging from 0.89 to 0.92 for the four classification tasks. The extraction is thus reliable and allows to significantly increase the availability of this valuable information on breast cancer receptor status at a population level. Frontiers Media S.A. 2021-08-17 /pmc/articles/PMC8522027/ /pubmed/34713168 http://dx.doi.org/10.3389/fdgth.2021.692077 Text en Copyright © 2021 Pironet, Poirel, Tambuyzer, De Schutter, van Walle, Mattheijssens, Henau, Van Eycken and Van Damme. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Digital Health
Pironet, Antoine
Poirel, Hélène A.
Tambuyzer, Tim
De Schutter, Harlinde
van Walle, Lien
Mattheijssens, Joris
Henau, Kris
Van Eycken, Liesbet
Van Damme, Nancy
Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports
title Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports
title_full Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports
title_fullStr Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports
title_full_unstemmed Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports
title_short Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports
title_sort machine learning-based extraction of breast cancer receptor status from bilingual free-text pathology reports
topic Digital Health
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8522027/
https://www.ncbi.nlm.nih.gov/pubmed/34713168
http://dx.doi.org/10.3389/fdgth.2021.692077
work_keys_str_mv AT pironetantoine machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT poirelhelenea machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT tambuyzertim machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT deschutterharlinde machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT vanwallelien machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT mattheijssensjoris machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT henaukris machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT vaneyckenliesbet machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports
AT vandammenancy machinelearningbasedextractionofbreastcancerreceptorstatusfrombilingualfreetextpathologyreports