Cargando…

TreeCluster: Clustering biological sequences using phylogenetic trees

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, mo...

Descripción completa

Detalles Bibliográficos
Autores principales: Balaban, Metin, Moshiri, Niema, Mai, Uyen, Jia, Xingfan, Mirarab, Siavash
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6705769/
https://www.ncbi.nlm.nih.gov/pubmed/31437182
http://dx.doi.org/10.1371/journal.pone.0221068
_version_ 1783445621777104896
author Balaban, Metin
Moshiri, Niema
Mai, Uyen
Jia, Xingfan
Mirarab, Siavash
author_facet Balaban, Metin
Moshiri, Niema
Mai, Uyen
Jia, Xingfan
Mirarab, Siavash
author_sort Balaban, Metin
collection PubMed
description Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.
format Online
Article
Text
id pubmed-6705769
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-67057692019-09-04 TreeCluster: Clustering biological sequences using phylogenetic trees Balaban, Metin Moshiri, Niema Mai, Uyen Jia, Xingfan Mirarab, Siavash PLoS One Research Article Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster. Public Library of Science 2019-08-22 /pmc/articles/PMC6705769/ /pubmed/31437182 http://dx.doi.org/10.1371/journal.pone.0221068 Text en © 2019 Balaban et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Balaban, Metin
Moshiri, Niema
Mai, Uyen
Jia, Xingfan
Mirarab, Siavash
TreeCluster: Clustering biological sequences using phylogenetic trees
title TreeCluster: Clustering biological sequences using phylogenetic trees
title_full TreeCluster: Clustering biological sequences using phylogenetic trees
title_fullStr TreeCluster: Clustering biological sequences using phylogenetic trees
title_full_unstemmed TreeCluster: Clustering biological sequences using phylogenetic trees
title_short TreeCluster: Clustering biological sequences using phylogenetic trees
title_sort treecluster: clustering biological sequences using phylogenetic trees
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6705769/
https://www.ncbi.nlm.nih.gov/pubmed/31437182
http://dx.doi.org/10.1371/journal.pone.0221068
work_keys_str_mv AT balabanmetin treeclusterclusteringbiologicalsequencesusingphylogenetictrees
AT moshiriniema treeclusterclusteringbiologicalsequencesusingphylogenetictrees
AT maiuyen treeclusterclusteringbiologicalsequencesusingphylogenetictrees
AT jiaxingfan treeclusterclusteringbiologicalsequencesusingphylogenetictrees
AT mirarabsiavash treeclusterclusteringbiologicalsequencesusingphylogenetictrees