Cargando…

The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets

A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Xiaodong, Edwards, Scott V, Liu, Liang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302055/
https://www.ncbi.nlm.nih.gov/pubmed/32011711
http://dx.doi.org/10.1093/sysbio/syaa008
_version_ 1783547786278469632
author Jiang, Xiaodong
Edwards, Scott V
Liu, Liang
author_facet Jiang, Xiaodong
Edwards, Scott V
Liu, Liang
author_sort Jiang, Xiaodong
collection PubMed
description A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.]
format Online
Article
Text
id pubmed-7302055
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-73020552020-06-23 The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets Jiang, Xiaodong Edwards, Scott V Liu, Liang Syst Biol Regular Articles A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.] Oxford University Press 2020-07 2020-02-03 /pmc/articles/PMC7302055/ /pubmed/32011711 http://dx.doi.org/10.1093/sysbio/syaa008 Text en © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Regular Articles
Jiang, Xiaodong
Edwards, Scott V
Liu, Liang
The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
title The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
title_full The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
title_fullStr The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
title_full_unstemmed The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
title_short The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
title_sort multispecies coalescent model outperforms concatenation across diverse phylogenomic data sets
topic Regular Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302055/
https://www.ncbi.nlm.nih.gov/pubmed/32011711
http://dx.doi.org/10.1093/sysbio/syaa008
work_keys_str_mv AT jiangxiaodong themultispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets
AT edwardsscottv themultispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets
AT liuliang themultispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets
AT jiangxiaodong multispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets
AT edwardsscottv multispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets
AT liuliang multispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets