A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
BACKGROUND: Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and n...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5064893/ https://www.ncbi.nlm.nih.gov/pubmed/27737628 http://dx.doi.org/10.1186/s12859-016-1277-1 |
_version_ | 1782460239458598912 |
---|---|
author | Hejase, Hussein A. Liu, Kevin J. |
author_facet | Hejase, Hussein A. Liu, Kevin J. |
author_sort | Hejase, Hussein A. |
collection | PubMed |
description | BACKGROUND: Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown. RESULTS: In this study, we quantify the performance of state-of-the-art phylogenetic network inference methods on large-scale datasets using empirical data sampled from natural mouse populations and a range of simulations using model phylogenies with a single reticulation. We find that, as in the case of phylogenetic tree inference, the performance of leading network inference methods is negatively impacted by both dimensions of dataset scale. In general, we found that topological accuracy degrades as the number of taxa increases; a similar effect was observed with increased sequence mutation rate. The most accurate methods were probabilistic inference methods which maximize either likelihood under coalescent-based models or pseudo-likelihood approximations to the model likelihood. The improved accuracy obtained with probabilistic inference methods comes at a computational cost in terms of runtime and main memory usage, which become prohibitive as dataset size grows past twenty-five taxa. None of the probabilistic methods completed analyses of datasets with 30 taxa or more after many weeks of CPU runtime. CONCLUSIONS: We conclude that the state of the art of phylogenetic network inference lags well behind the scope of current phylogenomic studies. New algorithmic development is critically needed to address this methodological gap. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1277-1) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5064893 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-50648932016-10-18 A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation Hejase, Hussein A. Liu, Kevin J. BMC Bioinformatics Research Article BACKGROUND: Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown. RESULTS: In this study, we quantify the performance of state-of-the-art phylogenetic network inference methods on large-scale datasets using empirical data sampled from natural mouse populations and a range of simulations using model phylogenies with a single reticulation. We find that, as in the case of phylogenetic tree inference, the performance of leading network inference methods is negatively impacted by both dimensions of dataset scale. In general, we found that topological accuracy degrades as the number of taxa increases; a similar effect was observed with increased sequence mutation rate. The most accurate methods were probabilistic inference methods which maximize either likelihood under coalescent-based models or pseudo-likelihood approximations to the model likelihood. The improved accuracy obtained with probabilistic inference methods comes at a computational cost in terms of runtime and main memory usage, which become prohibitive as dataset size grows past twenty-five taxa. None of the probabilistic methods completed analyses of datasets with 30 taxa or more after many weeks of CPU runtime. CONCLUSIONS: We conclude that the state of the art of phylogenetic network inference lags well behind the scope of current phylogenomic studies. New algorithmic development is critically needed to address this methodological gap. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1277-1) contains supplementary material, which is available to authorized users. BioMed Central 2016-10-13 /pmc/articles/PMC5064893/ /pubmed/27737628 http://dx.doi.org/10.1186/s12859-016-1277-1 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Hejase, Hussein A. Liu, Kevin J. A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation |
title | A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation |
title_full | A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation |
title_fullStr | A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation |
title_full_unstemmed | A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation |
title_short | A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation |
title_sort | scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5064893/ https://www.ncbi.nlm.nih.gov/pubmed/27737628 http://dx.doi.org/10.1186/s12859-016-1277-1 |
work_keys_str_mv | AT hejasehusseina ascalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation AT liukevinj ascalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation AT hejasehusseina scalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation AT liukevinj scalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation |