A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation

BACKGROUND: Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and n...

Descripción completa

Detalles Bibliográficos
Autores principales: Hejase, Hussein A., Liu, Kevin J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5064893/
https://www.ncbi.nlm.nih.gov/pubmed/27737628
http://dx.doi.org/10.1186/s12859-016-1277-1
_version_ 1782460239458598912
author Hejase, Hussein A.
Liu, Kevin J.
author_facet Hejase, Hussein A.
Liu, Kevin J.
author_sort Hejase, Hussein A.
collection PubMed
description BACKGROUND: Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown. RESULTS: In this study, we quantify the performance of state-of-the-art phylogenetic network inference methods on large-scale datasets using empirical data sampled from natural mouse populations and a range of simulations using model phylogenies with a single reticulation. We find that, as in the case of phylogenetic tree inference, the performance of leading network inference methods is negatively impacted by both dimensions of dataset scale. In general, we found that topological accuracy degrades as the number of taxa increases; a similar effect was observed with increased sequence mutation rate. The most accurate methods were probabilistic inference methods which maximize either likelihood under coalescent-based models or pseudo-likelihood approximations to the model likelihood. The improved accuracy obtained with probabilistic inference methods comes at a computational cost in terms of runtime and main memory usage, which become prohibitive as dataset size grows past twenty-five taxa. None of the probabilistic methods completed analyses of datasets with 30 taxa or more after many weeks of CPU runtime. CONCLUSIONS: We conclude that the state of the art of phylogenetic network inference lags well behind the scope of current phylogenomic studies. New algorithmic development is critically needed to address this methodological gap. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1277-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5064893
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50648932016-10-18 A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation Hejase, Hussein A. Liu, Kevin J. BMC Bioinformatics Research Article BACKGROUND: Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown. RESULTS: In this study, we quantify the performance of state-of-the-art phylogenetic network inference methods on large-scale datasets using empirical data sampled from natural mouse populations and a range of simulations using model phylogenies with a single reticulation. We find that, as in the case of phylogenetic tree inference, the performance of leading network inference methods is negatively impacted by both dimensions of dataset scale. In general, we found that topological accuracy degrades as the number of taxa increases; a similar effect was observed with increased sequence mutation rate. The most accurate methods were probabilistic inference methods which maximize either likelihood under coalescent-based models or pseudo-likelihood approximations to the model likelihood. The improved accuracy obtained with probabilistic inference methods comes at a computational cost in terms of runtime and main memory usage, which become prohibitive as dataset size grows past twenty-five taxa. None of the probabilistic methods completed analyses of datasets with 30 taxa or more after many weeks of CPU runtime. CONCLUSIONS: We conclude that the state of the art of phylogenetic network inference lags well behind the scope of current phylogenomic studies. New algorithmic development is critically needed to address this methodological gap. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1277-1) contains supplementary material, which is available to authorized users. BioMed Central 2016-10-13 /pmc/articles/PMC5064893/ /pubmed/27737628 http://dx.doi.org/10.1186/s12859-016-1277-1 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Hejase, Hussein A.
Liu, Kevin J.
A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
title A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
title_full A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
title_fullStr A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
title_full_unstemmed A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
title_short A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
title_sort scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5064893/
https://www.ncbi.nlm.nih.gov/pubmed/27737628
http://dx.doi.org/10.1186/s12859-016-1277-1
work_keys_str_mv AT hejasehusseina ascalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation
AT liukevinj ascalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation
AT hejasehusseina scalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation
AT liukevinj scalabilitystudyofphylogeneticnetworkinferencemethodsusingempiricaldatasetsandsimulationsinvolvingasinglereticulation