Cargando…
The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302055/ https://www.ncbi.nlm.nih.gov/pubmed/32011711 http://dx.doi.org/10.1093/sysbio/syaa008 |
_version_ | 1783547786278469632 |
---|---|
author | Jiang, Xiaodong Edwards, Scott V Liu, Liang |
author_facet | Jiang, Xiaodong Edwards, Scott V Liu, Liang |
author_sort | Jiang, Xiaodong |
collection | PubMed |
description | A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.] |
format | Online Article Text |
id | pubmed-7302055 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-73020552020-06-23 The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets Jiang, Xiaodong Edwards, Scott V Liu, Liang Syst Biol Regular Articles A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.] Oxford University Press 2020-07 2020-02-03 /pmc/articles/PMC7302055/ /pubmed/32011711 http://dx.doi.org/10.1093/sysbio/syaa008 Text en © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Regular Articles Jiang, Xiaodong Edwards, Scott V Liu, Liang The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets |
title | The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets |
title_full | The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets |
title_fullStr | The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets |
title_full_unstemmed | The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets |
title_short | The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets |
title_sort | multispecies coalescent model outperforms concatenation across diverse phylogenomic data sets |
topic | Regular Articles |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302055/ https://www.ncbi.nlm.nih.gov/pubmed/32011711 http://dx.doi.org/10.1093/sysbio/syaa008 |
work_keys_str_mv | AT jiangxiaodong themultispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets AT edwardsscottv themultispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets AT liuliang themultispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets AT jiangxiaodong multispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets AT edwardsscottv multispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets AT liuliang multispeciescoalescentmodeloutperformsconcatenationacrossdiversephylogenomicdatasets |