Cargando…

Automated data preparation for in vivo tumor characterization with machine learning

BACKGROUND: This study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts. METHODS: A collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts...

Descripción completa

Detalles Bibliográficos
Autores principales: Krajnc, Denis, Spielvogel, Clemens P., Grahovac, Marko, Ecsedi, Boglarka, Rasul, Sazan, Poetsch, Nina, Traub-Weidinger, Tatjana, Haug, Alexander R., Ritter, Zsombor, Alizadeh, Hussain, Hacker, Marcus, Beyer, Thomas, Papp, Laszlo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9595446/
https://www.ncbi.nlm.nih.gov/pubmed/36303841
http://dx.doi.org/10.3389/fonc.2022.1017911
_version_ 1784815651908485120
author Krajnc, Denis
Spielvogel, Clemens P.
Grahovac, Marko
Ecsedi, Boglarka
Rasul, Sazan
Poetsch, Nina
Traub-Weidinger, Tatjana
Haug, Alexander R.
Ritter, Zsombor
Alizadeh, Hussain
Hacker, Marcus
Beyer, Thomas
Papp, Laszlo
author_facet Krajnc, Denis
Spielvogel, Clemens P.
Grahovac, Marko
Ecsedi, Boglarka
Rasul, Sazan
Poetsch, Nina
Traub-Weidinger, Tatjana
Haug, Alexander R.
Ritter, Zsombor
Alizadeh, Hussain
Hacker, Marcus
Beyer, Thomas
Papp, Laszlo
author_sort Krajnc, Denis
collection PubMed
description BACKGROUND: This study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts. METHODS: A collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts. RESULTS: Sixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps. CONCLUSIONS: This study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.
format Online
Article
Text
id pubmed-9595446
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-95954462022-10-26 Automated data preparation for in vivo tumor characterization with machine learning Krajnc, Denis Spielvogel, Clemens P. Grahovac, Marko Ecsedi, Boglarka Rasul, Sazan Poetsch, Nina Traub-Weidinger, Tatjana Haug, Alexander R. Ritter, Zsombor Alizadeh, Hussain Hacker, Marcus Beyer, Thomas Papp, Laszlo Front Oncol Oncology BACKGROUND: This study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts. METHODS: A collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts. RESULTS: Sixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps. CONCLUSIONS: This study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings. Frontiers Media S.A. 2022-10-11 /pmc/articles/PMC9595446/ /pubmed/36303841 http://dx.doi.org/10.3389/fonc.2022.1017911 Text en Copyright © 2022 Krajnc, Spielvogel, Grahovac, Ecsedi, Rasul, Poetsch, Traub-Weidinger, Haug, Ritter, Alizadeh, Hacker, Beyer and Papp https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Oncology
Krajnc, Denis
Spielvogel, Clemens P.
Grahovac, Marko
Ecsedi, Boglarka
Rasul, Sazan
Poetsch, Nina
Traub-Weidinger, Tatjana
Haug, Alexander R.
Ritter, Zsombor
Alizadeh, Hussain
Hacker, Marcus
Beyer, Thomas
Papp, Laszlo
Automated data preparation for in vivo tumor characterization with machine learning
title Automated data preparation for in vivo tumor characterization with machine learning
title_full Automated data preparation for in vivo tumor characterization with machine learning
title_fullStr Automated data preparation for in vivo tumor characterization with machine learning
title_full_unstemmed Automated data preparation for in vivo tumor characterization with machine learning
title_short Automated data preparation for in vivo tumor characterization with machine learning
title_sort automated data preparation for in vivo tumor characterization with machine learning
topic Oncology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9595446/
https://www.ncbi.nlm.nih.gov/pubmed/36303841
http://dx.doi.org/10.3389/fonc.2022.1017911
work_keys_str_mv AT krajncdenis automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT spielvogelclemensp automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT grahovacmarko automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT ecsediboglarka automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT rasulsazan automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT poetschnina automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT traubweidingertatjana automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT haugalexanderr automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT ritterzsombor automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT alizadehhussain automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT hackermarcus automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT beyerthomas automateddatapreparationforinvivotumorcharacterizationwithmachinelearning
AT papplaszlo automateddatapreparationforinvivotumorcharacterizationwithmachinelearning