Cargando…

Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses

Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled b...

Descripción completa

Detalles Bibliográficos
Autores principales: Bayzid, Md Shamsuzzoha, Mirarab, Siavash, Boussau, Bastien, Warnow, Tandy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4472720/
https://www.ncbi.nlm.nih.gov/pubmed/26086579
http://dx.doi.org/10.1371/journal.pone.0129183
_version_ 1782377097459662848
author Bayzid, Md Shamsuzzoha
Mirarab, Siavash
Boussau, Bastien
Warnow, Tandy
author_facet Bayzid, Md Shamsuzzoha
Mirarab, Siavash
Boussau, Bastien
Warnow, Tandy
author_sort Bayzid, Md Shamsuzzoha
collection PubMed
description Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.
format Online
Article
Text
id pubmed-4472720
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-44727202015-06-29 Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses Bayzid, Md Shamsuzzoha Mirarab, Siavash Boussau, Bastien Warnow, Tandy PLoS One Research Article Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning. Public Library of Science 2015-06-18 /pmc/articles/PMC4472720/ /pubmed/26086579 http://dx.doi.org/10.1371/journal.pone.0129183 Text en © 2015 Bayzid et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Bayzid, Md Shamsuzzoha
Mirarab, Siavash
Boussau, Bastien
Warnow, Tandy
Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses
title Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses
title_full Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses
title_fullStr Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses
title_full_unstemmed Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses
title_short Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses
title_sort weighted statistical binning: enabling statistically consistent genome-scale phylogenetic analyses
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4472720/
https://www.ncbi.nlm.nih.gov/pubmed/26086579
http://dx.doi.org/10.1371/journal.pone.0129183
work_keys_str_mv AT bayzidmdshamsuzzoha weightedstatisticalbinningenablingstatisticallyconsistentgenomescalephylogeneticanalyses
AT mirarabsiavash weightedstatisticalbinningenablingstatisticallyconsistentgenomescalephylogeneticanalyses
AT boussaubastien weightedstatisticalbinningenablingstatisticallyconsistentgenomescalephylogeneticanalyses
AT warnowtandy weightedstatisticalbinningenablingstatisticallyconsistentgenomescalephylogeneticanalyses