Cargando…

Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices

Sequence clustering is a fundamental tool of molecular biology that is being challenged by increasing dataset sizes from high-throughput sequencing. The agglomerative algorithms that have been relied upon for their accuracy require the construction of computationally costly distance matrices which c...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kellom, Matthew, Raymond, Jason
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Journal of Biological Methods 2017
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6708925/ https://www.ncbi.nlm.nih.gov/pubmed/31453226 http://dx.doi.org/10.14440/jbm.2017.153

_version_	1783446089550004224
author	Kellom, Matthew Raymond, Jason
author_facet	Kellom, Matthew Raymond, Jason
author_sort	Kellom, Matthew
collection	PubMed
description	Sequence clustering is a fundamental tool of molecular biology that is being challenged by increasing dataset sizes from high-throughput sequencing. The agglomerative algorithms that have been relied upon for their accuracy require the construction of computationally costly distance matrices which can overwhelm basic research personal computers. Alternative algorithms exist, such as centroid-linkage, to circumvent large memory requirements but their results are often input-order dependent. We present a method for bootstrapping the results of many centroid-linkage clustering iterations into an aggregate set of clusters, increasing cluster accuracy without a distance matrix. This method ranks cluster edges by conservation across iterations and reconstructs aggregate clusters from the resulting ranked edge list, pruning out low-frequency cluster edges that may have been a result of a specific sequence input order. Aggregating centroid-linkage clustering iterations can help researchers using basic research personal computers acquire more reliable clustering results without increasing memory resources.
format	Online Article Text
id	pubmed-6708925
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Journal of Biological Methods
record_format	MEDLINE/PubMed
spelling	pubmed-67089252019-08-26 Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices Kellom, Matthew Raymond, Jason J Biol Methods Article Sequence clustering is a fundamental tool of molecular biology that is being challenged by increasing dataset sizes from high-throughput sequencing. The agglomerative algorithms that have been relied upon for their accuracy require the construction of computationally costly distance matrices which can overwhelm basic research personal computers. Alternative algorithms exist, such as centroid-linkage, to circumvent large memory requirements but their results are often input-order dependent. We present a method for bootstrapping the results of many centroid-linkage clustering iterations into an aggregate set of clusters, increasing cluster accuracy without a distance matrix. This method ranks cluster edges by conservation across iterations and reconstructs aggregate clusters from the resulting ranked edge list, pruning out low-frequency cluster edges that may have been a result of a specific sequence input order. Aggregating centroid-linkage clustering iterations can help researchers using basic research personal computers acquire more reliable clustering results without increasing memory resources. Journal of Biological Methods 2017-03-16 /pmc/articles/PMC6708925/ /pubmed/31453226 http://dx.doi.org/10.14440/jbm.2017.153 Text en © 2013-2018 The Journal of Biological Methods, All rights reserved. https://creativecommons.org/licenses/by/3.0/ This work is licensed under a Creative Commons Attribution 3.0 License.
spellingShingle	Article Kellom, Matthew Raymond, Jason Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices
title	Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices
title_full	Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices
title_fullStr	Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices
title_full_unstemmed	Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices
title_short	Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices
title_sort	using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6708925/ https://www.ncbi.nlm.nih.gov/pubmed/31453226 http://dx.doi.org/10.14440/jbm.2017.153
work_keys_str_mv	AT kellommatthew usingclusteredgecountingtoaggregateiterationsofcentroidlinkageclusteringresultsandavoidlargedistancematrices AT raymondjason usingclusteredgecountingtoaggregateiterationsofcentroidlinkageclusteringresultsandavoidlargedistancematrices

Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices

Ejemplares similares