Cargando…

OCTAL: Optimal Completion of gene trees in polynomial time

BACKGROUND: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream a...

Descripción completa

Detalles Bibliográficos
Autores principales: Christensen, Sarah, Molloy, Erin K., Vachaspati, Pranjal, Warnow, Tandy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5853121/
https://www.ncbi.nlm.nih.gov/pubmed/29568323
http://dx.doi.org/10.1186/s13015-018-0124-5
_version_ 1783306707391217664
author Christensen, Sarah
Molloy, Erin K.
Vachaspati, Pranjal
Warnow, Tandy
author_facet Christensen, Sarah
Molloy, Erin K.
Vachaspati, Pranjal
Warnow, Tandy
author_sort Christensen, Sarah
collection PubMed
description BACKGROUND: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. RESULTS: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson–Foulds (RF) distance, and we prove that OCTAL runs in [Formula: see text] time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. CONCLUSIONS: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL’s accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13015-018-0124-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5853121
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-58531212018-03-22 OCTAL: Optimal Completion of gene trees in polynomial time Christensen, Sarah Molloy, Erin K. Vachaspati, Pranjal Warnow, Tandy Algorithms Mol Biol Research BACKGROUND: For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable. RESULTS: We introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson–Foulds (RF) distance, and we prove that OCTAL runs in [Formula: see text] time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree. CONCLUSIONS: OCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL’s accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13015-018-0124-5) contains supplementary material, which is available to authorized users. BioMed Central 2018-03-15 /pmc/articles/PMC5853121/ /pubmed/29568323 http://dx.doi.org/10.1186/s13015-018-0124-5 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Christensen, Sarah
Molloy, Erin K.
Vachaspati, Pranjal
Warnow, Tandy
OCTAL: Optimal Completion of gene trees in polynomial time
title OCTAL: Optimal Completion of gene trees in polynomial time
title_full OCTAL: Optimal Completion of gene trees in polynomial time
title_fullStr OCTAL: Optimal Completion of gene trees in polynomial time
title_full_unstemmed OCTAL: Optimal Completion of gene trees in polynomial time
title_short OCTAL: Optimal Completion of gene trees in polynomial time
title_sort octal: optimal completion of gene trees in polynomial time
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5853121/
https://www.ncbi.nlm.nih.gov/pubmed/29568323
http://dx.doi.org/10.1186/s13015-018-0124-5
work_keys_str_mv AT christensensarah octaloptimalcompletionofgenetreesinpolynomialtime
AT molloyerink octaloptimalcompletionofgenetreesinpolynomialtime
AT vachaspatipranjal octaloptimalcompletionofgenetreesinpolynomialtime
AT warnowtandy octaloptimalcompletionofgenetreesinpolynomialtime