Cargando…

A pairwise strategy for imputing predictive features when combining multiple datasets

MOTIVATION: In the training of predictive models using high-dimensional genomic data, multiple studies’ worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to var...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wu, Yujie, Ren, Boyu, Patil, Prasad
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9835467/ https://www.ncbi.nlm.nih.gov/pubmed/36576001 http://dx.doi.org/10.1093/bioinformatics/btac839

_version_	1784868672867663872
author	Wu, Yujie Ren, Boyu Patil, Prasad
author_facet	Wu, Yujie Ren, Boyu Patil, Prasad
author_sort	Wu, Yujie
collection	PubMed
description	MOTIVATION: In the training of predictive models using high-dimensional genomic data, multiple studies’ worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to variations in expression measurement platform or technology. It is often common practice to work only with the intersection of features measured in common across all studies, which results in the blind discarding of potentially useful feature information that is measured in individual or subsets of studies. RESULTS: We characterize the loss in predictive performance incurred by using only the intersection of feature information available across all studies when training predictors using gene expression data from microarray and sequencing datasets. We study the properties of linear and polynomial regression for imputing discarded features and demonstrate improvements in the external performance of prediction functions through simulation and in gene expression data collected on breast cancer patients. To improve this process, we propose a pairwise strategy that applies any imputation algorithm to two studies at a time and averages imputed features across pairs. We demonstrate that the pairwise strategy is preferable to first merging all datasets together and imputing any resulting missing features. Finally, we provide insights on which subsets of intersected and study-specific features should be used so that missing-feature imputation best promotes cross-study replicability. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/YujieWuu/Pairwise_imputation. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.
format	Online Article Text
id	pubmed-9835467
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98354672023-01-17 A pairwise strategy for imputing predictive features when combining multiple datasets Wu, Yujie Ren, Boyu Patil, Prasad Bioinformatics Original Paper MOTIVATION: In the training of predictive models using high-dimensional genomic data, multiple studies’ worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to variations in expression measurement platform or technology. It is often common practice to work only with the intersection of features measured in common across all studies, which results in the blind discarding of potentially useful feature information that is measured in individual or subsets of studies. RESULTS: We characterize the loss in predictive performance incurred by using only the intersection of feature information available across all studies when training predictors using gene expression data from microarray and sequencing datasets. We study the properties of linear and polynomial regression for imputing discarded features and demonstrate improvements in the external performance of prediction functions through simulation and in gene expression data collected on breast cancer patients. To improve this process, we propose a pairwise strategy that applies any imputation algorithm to two studies at a time and averages imputed features across pairs. We demonstrate that the pairwise strategy is preferable to first merging all datasets together and imputing any resulting missing features. Finally, we provide insights on which subsets of intersected and study-specific features should be used so that missing-feature imputation best promotes cross-study replicability. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/YujieWuu/Pairwise_imputation. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online. Oxford University Press 2022-12-28 /pmc/articles/PMC9835467/ /pubmed/36576001 http://dx.doi.org/10.1093/bioinformatics/btac839 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Wu, Yujie Ren, Boyu Patil, Prasad A pairwise strategy for imputing predictive features when combining multiple datasets
title	A pairwise strategy for imputing predictive features when combining multiple datasets
title_full	A pairwise strategy for imputing predictive features when combining multiple datasets
title_fullStr	A pairwise strategy for imputing predictive features when combining multiple datasets
title_full_unstemmed	A pairwise strategy for imputing predictive features when combining multiple datasets
title_short	A pairwise strategy for imputing predictive features when combining multiple datasets
title_sort	pairwise strategy for imputing predictive features when combining multiple datasets
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9835467/ https://www.ncbi.nlm.nih.gov/pubmed/36576001 http://dx.doi.org/10.1093/bioinformatics/btac839
work_keys_str_mv	AT wuyujie apairwisestrategyforimputingpredictivefeatureswhencombiningmultipledatasets AT renboyu apairwisestrategyforimputingpredictivefeatureswhencombiningmultipledatasets AT patilprasad apairwisestrategyforimputingpredictivefeatureswhencombiningmultipledatasets AT wuyujie pairwisestrategyforimputingpredictivefeatureswhencombiningmultipledatasets AT renboyu pairwisestrategyforimputingpredictivefeatureswhencombiningmultipledatasets AT patilprasad pairwisestrategyforimputingpredictivefeatureswhencombiningmultipledatasets

A pairwise strategy for imputing predictive features when combining multiple datasets

Ejemplares similares