Cargando…

Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies

Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning al...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gao, Yilin, Sun, Fengzhu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10602384/ https://www.ncbi.nlm.nih.gov/pubmed/37844077 http://dx.doi.org/10.1371/journal.pcbi.1010608

_version_	1785126391095754752
author	Gao, Yilin Sun, Fengzhu
author_facet	Gao, Yilin Sun, Fengzhu
author_sort	Gao, Yilin
collection	PubMed
description	Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier’s reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier’s prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.
format	Online Article Text
id	pubmed-10602384
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-106023842023-10-27 Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies Gao, Yilin Sun, Fengzhu PLoS Comput Biol Research Article Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier’s reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier’s prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches. Public Library of Science 2023-10-16 /pmc/articles/PMC10602384/ /pubmed/37844077 http://dx.doi.org/10.1371/journal.pcbi.1010608 Text en © 2023 Gao, Sun https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Gao, Yilin Sun, Fengzhu Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
title	Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
title_full	Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
title_fullStr	Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
title_full_unstemmed	Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
title_short	Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
title_sort	batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10602384/ https://www.ncbi.nlm.nih.gov/pubmed/37844077 http://dx.doi.org/10.1371/journal.pcbi.1010608
work_keys_str_mv	AT gaoyilin batchnormalizationfollowedbymergingispowerfulforphenotypepredictionintegratingmultipleheterogeneousstudies AT sunfengzhu batchnormalizationfollowedbymergingispowerfulforphenotypepredictionintegratingmultipleheterogeneousstudies

Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies

Ejemplares similares