Cargando…

Transcriptome prediction performance across machine learning models and diverse ancestries

Transcriptome prediction methods such as PrediXcan and FUSION have become popular in complex trait mapping. Most transcriptome prediction models have been trained in European populations using methods that make parametric linear assumptions like the elastic net (EN). To potentially further optimize...

Descripción completa

Detalles Bibliográficos
Autores principales: Okoro, Paul C., Schubert, Ryan, Guo, Xiuqing, Johnson, W. Craig, Rotter, Jerome I., Hoeschele, Ina, Liu, Yongmei, Im, Hae Kyung, Luke, Amy, Dugas, Lara R., Wheeler, Heather E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087249/
https://www.ncbi.nlm.nih.gov/pubmed/33937878
http://dx.doi.org/10.1016/j.xhgg.2020.100019
_version_ 1783686633184296960
author Okoro, Paul C.
Schubert, Ryan
Guo, Xiuqing
Johnson, W. Craig
Rotter, Jerome I.
Hoeschele, Ina
Liu, Yongmei
Im, Hae Kyung
Luke, Amy
Dugas, Lara R.
Wheeler, Heather E.
author_facet Okoro, Paul C.
Schubert, Ryan
Guo, Xiuqing
Johnson, W. Craig
Rotter, Jerome I.
Hoeschele, Ina
Liu, Yongmei
Im, Hae Kyung
Luke, Amy
Dugas, Lara R.
Wheeler, Heather E.
author_sort Okoro, Paul C.
collection PubMed
description Transcriptome prediction methods such as PrediXcan and FUSION have become popular in complex trait mapping. Most transcriptome prediction models have been trained in European populations using methods that make parametric linear assumptions like the elastic net (EN). To potentially further optimize imputation performance of gene expression across global populations, we built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and evaluated their performance in comparison to EN. We trained models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole-blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN), we found that RF outperformed EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan revealed potential gene associations missed by EN models. Therefore, by integrating other ML modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits.
format Online
Article
Text
id pubmed-8087249
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-80872492021-04-30 Transcriptome prediction performance across machine learning models and diverse ancestries Okoro, Paul C. Schubert, Ryan Guo, Xiuqing Johnson, W. Craig Rotter, Jerome I. Hoeschele, Ina Liu, Yongmei Im, Hae Kyung Luke, Amy Dugas, Lara R. Wheeler, Heather E. HGG Adv Article Transcriptome prediction methods such as PrediXcan and FUSION have become popular in complex trait mapping. Most transcriptome prediction models have been trained in European populations using methods that make parametric linear assumptions like the elastic net (EN). To potentially further optimize imputation performance of gene expression across global populations, we built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and evaluated their performance in comparison to EN. We trained models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole-blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN), we found that RF outperformed EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan revealed potential gene associations missed by EN models. Therefore, by integrating other ML modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits. Elsevier 2021-01-05 /pmc/articles/PMC8087249/ /pubmed/33937878 http://dx.doi.org/10.1016/j.xhgg.2020.100019 Text en © 2020 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Okoro, Paul C.
Schubert, Ryan
Guo, Xiuqing
Johnson, W. Craig
Rotter, Jerome I.
Hoeschele, Ina
Liu, Yongmei
Im, Hae Kyung
Luke, Amy
Dugas, Lara R.
Wheeler, Heather E.
Transcriptome prediction performance across machine learning models and diverse ancestries
title Transcriptome prediction performance across machine learning models and diverse ancestries
title_full Transcriptome prediction performance across machine learning models and diverse ancestries
title_fullStr Transcriptome prediction performance across machine learning models and diverse ancestries
title_full_unstemmed Transcriptome prediction performance across machine learning models and diverse ancestries
title_short Transcriptome prediction performance across machine learning models and diverse ancestries
title_sort transcriptome prediction performance across machine learning models and diverse ancestries
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087249/
https://www.ncbi.nlm.nih.gov/pubmed/33937878
http://dx.doi.org/10.1016/j.xhgg.2020.100019
work_keys_str_mv AT okoropaulc transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT schubertryan transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT guoxiuqing transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT johnsonwcraig transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT rotterjeromei transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT hoescheleina transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT liuyongmei transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT imhaekyung transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT lukeamy transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT dugaslarar transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries
AT wheelerheathere transcriptomepredictionperformanceacrossmachinelearningmodelsanddiverseancestries