Cargando…

A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem

A large variety of pairwise measures of similarity or dissimilarity have been developed for comparing phylogenetic trees, for example, species trees or gene trees. Due to its intuitive definition in terms of tree clades and bipartitions and its computational efficiency, the Robinson–Foulds (RF) dist...

Descripción completa

Detalles Bibliográficos
Autores principales: Briand, Samuel, Dessimoz, Christophe, El-Mabrouk, Nadia, Nevers, Yannis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9557742/
https://www.ncbi.nlm.nih.gov/pubmed/35426933
http://dx.doi.org/10.1093/sysbio/syac028
_version_ 1784807292405809152
author Briand, Samuel
Dessimoz, Christophe
El-Mabrouk, Nadia
Nevers, Yannis
author_facet Briand, Samuel
Dessimoz, Christophe
El-Mabrouk, Nadia
Nevers, Yannis
author_sort Briand, Samuel
collection PubMed
description A large variety of pairwise measures of similarity or dissimilarity have been developed for comparing phylogenetic trees, for example, species trees or gene trees. Due to its intuitive definition in terms of tree clades and bipartitions and its computational efficiency, the Robinson–Foulds (RF) distance is the most widely used for trees with unweighted edges and labels restricted to leaves (representing the genetic elements being compared). However, in the case of gene trees, an important information revealing the nature of the homologous relation between gene pairs (orthologs, paralogs, and xenologs) is the type of event associated to each internal node of the tree, typically speciations or duplications, but other types of events may also be considered, such as horizontal gene transfers. This labeling of internal nodes is usually inferred from a gene tree/species tree reconciliation method. Here, we address the problem of comparing such event-labeled trees. The problem differs from the classical problem of comparing uniformly labeled trees (all labels belonging to the same alphabet) that may be done using the Tree Edit Distance (TED) mainly due to the fact that, in our case, two different alphabets are considered for the leaves and internal nodes of the tree, and leaves are not affected by edit operations. We propose an extension of the RF distance to event-labeled trees, based on edit operations comparable to those considered for TED: node insertion, node deletion, and label substitution. We show that this new Labeled Robinson–Foulds (LRF) distance can be computed in linear time, in addition of maintaining other desirable properties: being a metric, reducing to RF for trees with no labels on internal nodes and maintaining an intuitive interpretation. The algorithm for computing the LRF distance enables novel analyses on event-label trees such as reconciled gene trees. Here, we use it to study the impact of taxon sampling on labeled gene tree inference and conclude that denser taxon sampling yields trees with better topology but worse labeling. [Algorithms; combinatorics; gene trees; phylogenetics; Robinson–Foulds; tree distance.]
format Online
Article
Text
id pubmed-9557742
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-95577422022-10-14 A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem Briand, Samuel Dessimoz, Christophe El-Mabrouk, Nadia Nevers, Yannis Syst Biol Regular Articles A large variety of pairwise measures of similarity or dissimilarity have been developed for comparing phylogenetic trees, for example, species trees or gene trees. Due to its intuitive definition in terms of tree clades and bipartitions and its computational efficiency, the Robinson–Foulds (RF) distance is the most widely used for trees with unweighted edges and labels restricted to leaves (representing the genetic elements being compared). However, in the case of gene trees, an important information revealing the nature of the homologous relation between gene pairs (orthologs, paralogs, and xenologs) is the type of event associated to each internal node of the tree, typically speciations or duplications, but other types of events may also be considered, such as horizontal gene transfers. This labeling of internal nodes is usually inferred from a gene tree/species tree reconciliation method. Here, we address the problem of comparing such event-labeled trees. The problem differs from the classical problem of comparing uniformly labeled trees (all labels belonging to the same alphabet) that may be done using the Tree Edit Distance (TED) mainly due to the fact that, in our case, two different alphabets are considered for the leaves and internal nodes of the tree, and leaves are not affected by edit operations. We propose an extension of the RF distance to event-labeled trees, based on edit operations comparable to those considered for TED: node insertion, node deletion, and label substitution. We show that this new Labeled Robinson–Foulds (LRF) distance can be computed in linear time, in addition of maintaining other desirable properties: being a metric, reducing to RF for trees with no labels on internal nodes and maintaining an intuitive interpretation. The algorithm for computing the LRF distance enables novel analyses on event-label trees such as reconciled gene trees. Here, we use it to study the impact of taxon sampling on labeled gene tree inference and conclude that denser taxon sampling yields trees with better topology but worse labeling. [Algorithms; combinatorics; gene trees; phylogenetics; Robinson–Foulds; tree distance.] Oxford University Press 2022-04-15 /pmc/articles/PMC9557742/ /pubmed/35426933 http://dx.doi.org/10.1093/sysbio/syac028 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of the Society of Systematic Biologists. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Regular Articles
Briand, Samuel
Dessimoz, Christophe
El-Mabrouk, Nadia
Nevers, Yannis
A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem
title A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem
title_full A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem
title_fullStr A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem
title_full_unstemmed A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem
title_short A Linear Time Solution to the Labeled Robinson–Foulds Distance Problem
title_sort linear time solution to the labeled robinson–foulds distance problem
topic Regular Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9557742/
https://www.ncbi.nlm.nih.gov/pubmed/35426933
http://dx.doi.org/10.1093/sysbio/syac028
work_keys_str_mv AT briandsamuel alineartimesolutiontothelabeledrobinsonfouldsdistanceproblem
AT dessimozchristophe alineartimesolutiontothelabeledrobinsonfouldsdistanceproblem
AT elmabrouknadia alineartimesolutiontothelabeledrobinsonfouldsdistanceproblem
AT neversyannis alineartimesolutiontothelabeledrobinsonfouldsdistanceproblem
AT briandsamuel lineartimesolutiontothelabeledrobinsonfouldsdistanceproblem
AT dessimozchristophe lineartimesolutiontothelabeledrobinsonfouldsdistanceproblem
AT elmabrouknadia lineartimesolutiontothelabeledrobinsonfouldsdistanceproblem
AT neversyannis lineartimesolutiontothelabeledrobinsonfouldsdistanceproblem