Cargando…

Combinatorial and statistical prediction of gene expression from haplotype sequence

MOTIVATION: Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these vari...

Descripción completa

Detalles Bibliográficos
Autores principales: Alpay, Berk A, Demetci, Pinar, Istrail, Sorin, Aguiar, Derek
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355230/
https://www.ncbi.nlm.nih.gov/pubmed/32657373
http://dx.doi.org/10.1093/bioinformatics/btaa318
_version_ 1783558232472551424
author Alpay, Berk A
Demetci, Pinar
Istrail, Sorin
Aguiar, Derek
author_facet Alpay, Berk A
Demetci, Pinar
Istrail, Sorin
Aguiar, Derek
author_sort Alpay, Berk A
collection PubMed
description MOTIVATION: Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. RESULTS: In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes ([Formula: see text]) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7355230
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-73552302020-07-16 Combinatorial and statistical prediction of gene expression from haplotype sequence Alpay, Berk A Demetci, Pinar Istrail, Sorin Aguiar, Derek Bioinformatics Genomic Variation Analysis MOTIVATION: Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. RESULTS: In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes ([Formula: see text]) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7355230/ /pubmed/32657373 http://dx.doi.org/10.1093/bioinformatics/btaa318 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Genomic Variation Analysis
Alpay, Berk A
Demetci, Pinar
Istrail, Sorin
Aguiar, Derek
Combinatorial and statistical prediction of gene expression from haplotype sequence
title Combinatorial and statistical prediction of gene expression from haplotype sequence
title_full Combinatorial and statistical prediction of gene expression from haplotype sequence
title_fullStr Combinatorial and statistical prediction of gene expression from haplotype sequence
title_full_unstemmed Combinatorial and statistical prediction of gene expression from haplotype sequence
title_short Combinatorial and statistical prediction of gene expression from haplotype sequence
title_sort combinatorial and statistical prediction of gene expression from haplotype sequence
topic Genomic Variation Analysis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355230/
https://www.ncbi.nlm.nih.gov/pubmed/32657373
http://dx.doi.org/10.1093/bioinformatics/btaa318
work_keys_str_mv AT alpayberka combinatorialandstatisticalpredictionofgeneexpressionfromhaplotypesequence
AT demetcipinar combinatorialandstatisticalpredictionofgeneexpressionfromhaplotypesequence
AT istrailsorin combinatorialandstatisticalpredictionofgeneexpressionfromhaplotypesequence
AT aguiarderek combinatorialandstatisticalpredictionofgeneexpressionfromhaplotypesequence