Cargando…

A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny

BACKGROUND: Widely used substitution models for proteins, such as the Jones-Taylor-Thornton (JTT) or Whelan and Goldman (WAG) models, are based on empirical amino acid interchange matrices estimated from databases of protein alignments that incorporate the average amino acid frequencies of the data...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Huai-Chun, Li, Karen, Susko, Edward, Roger, Andrew J
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2628903/ https://www.ncbi.nlm.nih.gov/pubmed/19087270 http://dx.doi.org/10.1186/1471-2148-8-331

_version_	1782163748012687360
author	Wang, Huai-Chun Li, Karen Susko, Edward Roger, Andrew J
author_facet	Wang, Huai-Chun Li, Karen Susko, Edward Roger, Andrew J
author_sort	Wang, Huai-Chun
collection	PubMed
description	BACKGROUND: Widely used substitution models for proteins, such as the Jones-Taylor-Thornton (JTT) or Whelan and Goldman (WAG) models, are based on empirical amino acid interchange matrices estimated from databases of protein alignments that incorporate the average amino acid frequencies of the data set under examination (e.g JTT + F). Variation in the evolutionary process between sites is typically modelled by a rates-across-sites distribution such as the gamma (Γ) distribution. However, sites in proteins also vary in the kinds of amino acid interchanges that are favoured, a feature that is ignored by standard empirical substitution matrices. Here we examine the degree to which the pattern of evolution at sites differs from that expected based on empirical amino acid substitution models and evaluate the impact of these deviations on phylogenetic estimation. RESULTS: We analyzed 21 large protein alignments with two statistical tests designed to detect deviation of site-specific amino acid distributions from data simulated under the standard empirical substitution model: JTT+ F + Γ. We found that the number of states at a given site is, on average, smaller and the frequencies of these states are less uniform than expected based on a JTT + F + Γ substitution model. With a four-taxon example, we show that phylogenetic estimation under the JTT + F + Γ model is seriously biased by a long-branch attraction artefact if the data are simulated under a model utilizing the observed site-specific amino acid frequencies from an alignment. Principal components analyses indicate the existence of at least four major site-specific frequency classes in these 21 protein alignments. Using a mixture model with these four separate classes of site-specific state frequencies plus a fifth class of global frequencies (the JTT + cF + Γ model), significant improvements in model fit for real data sets can be achieved. This simple mixture model also reduces the long-branch attraction problem, as shown by simulations and analyses of a real phylogenomic data set. CONCLUSION: Protein families display site-specific evolutionary dynamics that are ignored by standard protein phylogenetic models. Accurate estimation of protein phylogenies requires models that accommodate the heterogeneity in the evolutionary process across sites. To this end, we have implemented a class frequency mixture model (cF) in a freely available program called QmmRAxML for phylogenetic estimation.
format	Text
id	pubmed-2628903
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26289032009-01-21 A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny Wang, Huai-Chun Li, Karen Susko, Edward Roger, Andrew J BMC Evol Biol Methodology Article BACKGROUND: Widely used substitution models for proteins, such as the Jones-Taylor-Thornton (JTT) or Whelan and Goldman (WAG) models, are based on empirical amino acid interchange matrices estimated from databases of protein alignments that incorporate the average amino acid frequencies of the data set under examination (e.g JTT + F). Variation in the evolutionary process between sites is typically modelled by a rates-across-sites distribution such as the gamma (Γ) distribution. However, sites in proteins also vary in the kinds of amino acid interchanges that are favoured, a feature that is ignored by standard empirical substitution matrices. Here we examine the degree to which the pattern of evolution at sites differs from that expected based on empirical amino acid substitution models and evaluate the impact of these deviations on phylogenetic estimation. RESULTS: We analyzed 21 large protein alignments with two statistical tests designed to detect deviation of site-specific amino acid distributions from data simulated under the standard empirical substitution model: JTT+ F + Γ. We found that the number of states at a given site is, on average, smaller and the frequencies of these states are less uniform than expected based on a JTT + F + Γ substitution model. With a four-taxon example, we show that phylogenetic estimation under the JTT + F + Γ model is seriously biased by a long-branch attraction artefact if the data are simulated under a model utilizing the observed site-specific amino acid frequencies from an alignment. Principal components analyses indicate the existence of at least four major site-specific frequency classes in these 21 protein alignments. Using a mixture model with these four separate classes of site-specific state frequencies plus a fifth class of global frequencies (the JTT + cF + Γ model), significant improvements in model fit for real data sets can be achieved. This simple mixture model also reduces the long-branch attraction problem, as shown by simulations and analyses of a real phylogenomic data set. CONCLUSION: Protein families display site-specific evolutionary dynamics that are ignored by standard protein phylogenetic models. Accurate estimation of protein phylogenies requires models that accommodate the heterogeneity in the evolutionary process across sites. To this end, we have implemented a class frequency mixture model (cF) in a freely available program called QmmRAxML for phylogenetic estimation. BioMed Central 2008-12-16 /pmc/articles/PMC2628903/ /pubmed/19087270 http://dx.doi.org/10.1186/1471-2148-8-331 Text en Copyright ©2008 Wang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Wang, Huai-Chun Li, Karen Susko, Edward Roger, Andrew J A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
title	A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
title_full	A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
title_fullStr	A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
title_full_unstemmed	A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
title_short	A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
title_sort	class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2628903/ https://www.ncbi.nlm.nih.gov/pubmed/19087270 http://dx.doi.org/10.1186/1471-2148-8-331
work_keys_str_mv	AT wanghuaichun aclassfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny AT likaren aclassfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny AT suskoedward aclassfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny AT rogerandrewj aclassfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny AT wanghuaichun classfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny AT likaren classfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny AT suskoedward classfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny AT rogerandrewj classfrequencymixturemodelthatadjustsforsitespecificaminoacidfrequenciesandimprovesinferenceofproteinphylogeny

A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny

Ejemplares similares