Cargando…

Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease

A central challenge in human genomics is to understand the cellular, evolutionary, and clinical significance of genetic variants. Here, we introduce a unified population-genetic and machine-learning model, called Linear Allele-Specific Selection InferencE (LASSIE), for estimating the fitness effects...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Yi-Fei, Siepel, Adam
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6673719/
https://www.ncbi.nlm.nih.gov/pubmed/31249063
http://dx.doi.org/10.1101/gr.245522.118
_version_ 1783440596133740544
author Huang, Yi-Fei
Siepel, Adam
author_facet Huang, Yi-Fei
Siepel, Adam
author_sort Huang, Yi-Fei
collection PubMed
description A central challenge in human genomics is to understand the cellular, evolutionary, and clinical significance of genetic variants. Here, we introduce a unified population-genetic and machine-learning model, called Linear Allele-Specific Selection InferencE (LASSIE), for estimating the fitness effects of all observed and potential single-nucleotide variants, based on polymorphism data and predictive genomic features. We applied LASSIE to 51 high-coverage genome sequences annotated with 33 genomic features and constructed a map of allele-specific selection coefficients across all protein-coding sequences in the human genome. This map is generally consistent with previous inferences of the bulk distribution of fitness effects but reveals pervasive weak negative selection against synonymous mutations. In addition, the estimated selection coefficients are highly predictive of inherited pathogenic variants and cancer driver mutations, outperforming state-of-the-art variant prioritization methods. By contrasting our estimated model with ultrahigh coverage ExAC exome-sequencing data, we identified 1118 genes under unusually strong negative selection, which tend to be exclusively expressed in the central nervous system or associated with autism spectrum disorder, as well as 773 genes under unusually weak selection, which tend to be associated with metabolism. This combination of classical population genetic theory with modern machine-learning and large-scale genomic data is a powerful paradigm for the study of both human evolution and disease.
format Online
Article
Text
id pubmed-6673719
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-66737192020-02-01 Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease Huang, Yi-Fei Siepel, Adam Genome Res Method A central challenge in human genomics is to understand the cellular, evolutionary, and clinical significance of genetic variants. Here, we introduce a unified population-genetic and machine-learning model, called Linear Allele-Specific Selection InferencE (LASSIE), for estimating the fitness effects of all observed and potential single-nucleotide variants, based on polymorphism data and predictive genomic features. We applied LASSIE to 51 high-coverage genome sequences annotated with 33 genomic features and constructed a map of allele-specific selection coefficients across all protein-coding sequences in the human genome. This map is generally consistent with previous inferences of the bulk distribution of fitness effects but reveals pervasive weak negative selection against synonymous mutations. In addition, the estimated selection coefficients are highly predictive of inherited pathogenic variants and cancer driver mutations, outperforming state-of-the-art variant prioritization methods. By contrasting our estimated model with ultrahigh coverage ExAC exome-sequencing data, we identified 1118 genes under unusually strong negative selection, which tend to be exclusively expressed in the central nervous system or associated with autism spectrum disorder, as well as 773 genes under unusually weak selection, which tend to be associated with metabolism. This combination of classical population genetic theory with modern machine-learning and large-scale genomic data is a powerful paradigm for the study of both human evolution and disease. Cold Spring Harbor Laboratory Press 2019-08 /pmc/articles/PMC6673719/ /pubmed/31249063 http://dx.doi.org/10.1101/gr.245522.118 Text en © 2019 Huang and Siepel; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Method
Huang, Yi-Fei
Siepel, Adam
Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
title Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
title_full Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
title_fullStr Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
title_full_unstemmed Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
title_short Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
title_sort estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6673719/
https://www.ncbi.nlm.nih.gov/pubmed/31249063
http://dx.doi.org/10.1101/gr.245522.118
work_keys_str_mv AT huangyifei estimationofallelespecificfitnesseffectsacrosshumanproteincodingsequencesandimplicationsfordisease
AT siepeladam estimationofallelespecificfitnesseffectsacrosshumanproteincodingsequencesandimplicationsfordisease