Cargando…

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sharafutdinov, Konstantin, Bhat, Jayesh S., Fritsch, Sebastian Johannes, Nikulina, Kateryna, E. Samadi, Moein, Polzin, Richard, Mayer, Hannah, Marx, Gernot, Bickenbach, Johannes, Schuppert, Andreas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Big Data
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9659720/ https://www.ncbi.nlm.nih.gov/pubmed/36387013 http://dx.doi.org/10.3389/fdata.2022.603429

_version_	1784830258937069568
author	Sharafutdinov, Konstantin Bhat, Jayesh S. Fritsch, Sebastian Johannes Nikulina, Kateryna E. Samadi, Moein Polzin, Richard Mayer, Hannah Marx, Gernot Bickenbach, Johannes Schuppert, Andreas
author_facet	Sharafutdinov, Konstantin Bhat, Jayesh S. Fritsch, Sebastian Johannes Nikulina, Kateryna E. Samadi, Moein Polzin, Richard Mayer, Hannah Marx, Gernot Bickenbach, Johannes Schuppert, Andreas
author_sort	Sharafutdinov, Konstantin
collection	PubMed
description	Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites.
format	Online Article Text
id	pubmed-9659720
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-96597202022-11-15 Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets Sharafutdinov, Konstantin Bhat, Jayesh S. Fritsch, Sebastian Johannes Nikulina, Kateryna E. Samadi, Moein Polzin, Richard Mayer, Hannah Marx, Gernot Bickenbach, Johannes Schuppert, Andreas Front Big Data Big Data Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites. Frontiers Media S.A. 2022-10-31 /pmc/articles/PMC9659720/ /pubmed/36387013 http://dx.doi.org/10.3389/fdata.2022.603429 Text en Copyright © 2022 Sharafutdinov, Bhat, Fritsch, Nikulina, E. Samadi, Polzin, Mayer, Marx, Bickenbach and Schuppert. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Big Data Sharafutdinov, Konstantin Bhat, Jayesh S. Fritsch, Sebastian Johannes Nikulina, Kateryna E. Samadi, Moein Polzin, Richard Mayer, Hannah Marx, Gernot Bickenbach, Johannes Schuppert, Andreas Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
title	Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
title_full	Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
title_fullStr	Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
title_full_unstemmed	Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
title_short	Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets
title_sort	application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: a comparison of 4 critical-care patient datasets
topic	Big Data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9659720/ https://www.ncbi.nlm.nih.gov/pubmed/36387013 http://dx.doi.org/10.3389/fdata.2022.603429
work_keys_str_mv	AT sharafutdinovkonstantin applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT bhatjayeshs applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT fritschsebastianjohannes applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT nikulinakateryna applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT esamadimoein applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT polzinrichard applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT mayerhannah applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT marxgernot applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT bickenbachjohannes applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets AT schuppertandreas applicationofconvexhullanalysisfortheevaluationofdataheterogeneitybetweenpatientpopulationsofdifferentoriginandimplicationsofhospitalbiasindownstreammachinelearningbaseddataprocessingacomparisonof4criticalcarepatientdatasets

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

Ejemplares similares