Cargando…

Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records

SIMPLE SUMMARY: Breast cancer is a heterogeneous disease characterized by different risks of relapse, which makes it challenging to predict progression and select the most appropriate follow-up strategies. With the ever-growing adoption of Electronic Health Records, there are great opportunities to...

Descripción completa

Detalles Bibliográficos
Autores principales: González-Castro, Lorena, Chávez, Marcela, Duflot, Patrick, Bleret, Valérie, Martin, Alistair G., Zobel, Marc, Nateqi, Jama, Lin, Simon, Pazos-Arias, José J., Del Fiol, Guilherme, López-Nores, Martín
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10216131/
https://www.ncbi.nlm.nih.gov/pubmed/37345078
http://dx.doi.org/10.3390/cancers15102741
_version_ 1785048224982106112
author González-Castro, Lorena
Chávez, Marcela
Duflot, Patrick
Bleret, Valérie
Martin, Alistair G.
Zobel, Marc
Nateqi, Jama
Lin, Simon
Pazos-Arias, José J.
Del Fiol, Guilherme
López-Nores, Martín
author_facet González-Castro, Lorena
Chávez, Marcela
Duflot, Patrick
Bleret, Valérie
Martin, Alistair G.
Zobel, Marc
Nateqi, Jama
Lin, Simon
Pazos-Arias, José J.
Del Fiol, Guilherme
López-Nores, Martín
author_sort González-Castro, Lorena
collection PubMed
description SIMPLE SUMMARY: Breast cancer is a heterogeneous disease characterized by different risks of relapse, which makes it challenging to predict progression and select the most appropriate follow-up strategies. With the ever-growing adoption of Electronic Health Records, there are great opportunities to leverage the amount of data collected routinely in electronic format for secondary purposes. Machine Learning algorithms offer the ability to analyze large amounts of data and reveal insights that might otherwise go undetected. In this study, we have applied several algorithms to predict 5-year breast cancer recurrence from health data. We compared whether taking advantage of both structured and unstructured data from health records yields better prediction results than using any of the sources separately. These algorithms are valuable tools to help clinicians effectively integrate large amounts of data into their decision-making and are key to improving risk stratification and providing personalized assistance to patients. ABSTRACT: Recurrence is a critical aspect of breast cancer (BC) that is inexorably tied to mortality. Reuse of healthcare data through Machine Learning (ML) algorithms offers great opportunities to improve the stratification of patients at risk of cancer recurrence. We hypothesized that combining features from structured and unstructured sources would provide better prediction results for 5-year cancer recurrence than either source alone. We collected and preprocessed clinical data from a cohort of BC patients, resulting in 823 valid subjects for analysis. We derived three sets of features: structured information, features from free text, and a combination of both. We evaluated the performance of five ML algorithms to predict 5-year cancer recurrence and selected the best-performing to test our hypothesis. The XGB (eXtreme Gradient Boosting) model yielded the best performance among the five evaluated algorithms, with precision = 0.900, recall = 0.907, F1-score = 0.897, and area under the receiver operating characteristic AUROC = 0.807. The best prediction results were achieved with the structured dataset, followed by the unstructured dataset, while the combined dataset achieved the poorest performance. ML algorithms for BC recurrence prediction are valuable tools to improve patient risk stratification, help with post-cancer monitoring, and plan more effective follow-up. Structured data provides the best results when fed to ML algorithms. However, an approach based on natural language processing offers comparable results while potentially requiring less mapping effort.
format Online
Article
Text
id pubmed-10216131
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-102161312023-05-27 Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records González-Castro, Lorena Chávez, Marcela Duflot, Patrick Bleret, Valérie Martin, Alistair G. Zobel, Marc Nateqi, Jama Lin, Simon Pazos-Arias, José J. Del Fiol, Guilherme López-Nores, Martín Cancers (Basel) Article SIMPLE SUMMARY: Breast cancer is a heterogeneous disease characterized by different risks of relapse, which makes it challenging to predict progression and select the most appropriate follow-up strategies. With the ever-growing adoption of Electronic Health Records, there are great opportunities to leverage the amount of data collected routinely in electronic format for secondary purposes. Machine Learning algorithms offer the ability to analyze large amounts of data and reveal insights that might otherwise go undetected. In this study, we have applied several algorithms to predict 5-year breast cancer recurrence from health data. We compared whether taking advantage of both structured and unstructured data from health records yields better prediction results than using any of the sources separately. These algorithms are valuable tools to help clinicians effectively integrate large amounts of data into their decision-making and are key to improving risk stratification and providing personalized assistance to patients. ABSTRACT: Recurrence is a critical aspect of breast cancer (BC) that is inexorably tied to mortality. Reuse of healthcare data through Machine Learning (ML) algorithms offers great opportunities to improve the stratification of patients at risk of cancer recurrence. We hypothesized that combining features from structured and unstructured sources would provide better prediction results for 5-year cancer recurrence than either source alone. We collected and preprocessed clinical data from a cohort of BC patients, resulting in 823 valid subjects for analysis. We derived three sets of features: structured information, features from free text, and a combination of both. We evaluated the performance of five ML algorithms to predict 5-year cancer recurrence and selected the best-performing to test our hypothesis. The XGB (eXtreme Gradient Boosting) model yielded the best performance among the five evaluated algorithms, with precision = 0.900, recall = 0.907, F1-score = 0.897, and area under the receiver operating characteristic AUROC = 0.807. The best prediction results were achieved with the structured dataset, followed by the unstructured dataset, while the combined dataset achieved the poorest performance. ML algorithms for BC recurrence prediction are valuable tools to improve patient risk stratification, help with post-cancer monitoring, and plan more effective follow-up. Structured data provides the best results when fed to ML algorithms. However, an approach based on natural language processing offers comparable results while potentially requiring less mapping effort. MDPI 2023-05-13 /pmc/articles/PMC10216131/ /pubmed/37345078 http://dx.doi.org/10.3390/cancers15102741 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
González-Castro, Lorena
Chávez, Marcela
Duflot, Patrick
Bleret, Valérie
Martin, Alistair G.
Zobel, Marc
Nateqi, Jama
Lin, Simon
Pazos-Arias, José J.
Del Fiol, Guilherme
López-Nores, Martín
Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records
title Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records
title_full Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records
title_fullStr Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records
title_full_unstemmed Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records
title_short Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records
title_sort machine learning algorithms to predict breast cancer recurrence using structured and unstructured sources from electronic health records
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10216131/
https://www.ncbi.nlm.nih.gov/pubmed/37345078
http://dx.doi.org/10.3390/cancers15102741
work_keys_str_mv AT gonzalezcastrolorena machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT chavezmarcela machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT duflotpatrick machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT bleretvalerie machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT martinalistairg machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT zobelmarc machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT nateqijama machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT linsimon machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT pazosariasjosej machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT delfiolguilherme machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords
AT lopeznoresmartin machinelearningalgorithmstopredictbreastcancerrecurrenceusingstructuredandunstructuredsourcesfromelectronichealthrecords