Cargando…

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computa...

Descripción completa

Detalles Bibliográficos
Autores principales: Cai, Yunpeng, Zheng, Wei, Yao, Jin, Yang, Yujie, Mai, Volker, Mao, Qi, Sun, Yijun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5421816/
https://www.ncbi.nlm.nih.gov/pubmed/28437450
http://dx.doi.org/10.1371/journal.pcbi.1005518
_version_ 1783234655343869952
author Cai, Yunpeng
Zheng, Wei
Yao, Jin
Yang, Yujie
Mai, Volker
Mao, Qi
Sun, Yijun
author_facet Cai, Yunpeng
Zheng, Wei
Yao, Jin
Yang, Yujie
Mai, Volker
Mao, Qi
Sun, Yijun
author_sort Cai, Yunpeng
collection PubMed
description The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.
format Online
Article
Text
id pubmed-5421816
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-54218162017-05-12 ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time Cai, Yunpeng Zheng, Wei Yao, Jin Yang, Yujie Mai, Volker Mao, Qi Sun, Yijun PLoS Comput Biol Research Article The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html. Public Library of Science 2017-04-24 /pmc/articles/PMC5421816/ /pubmed/28437450 http://dx.doi.org/10.1371/journal.pcbi.1005518 Text en © 2017 Cai et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Cai, Yunpeng
Zheng, Wei
Yao, Jin
Yang, Yujie
Mai, Volker
Mao, Qi
Sun, Yijun
ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time
title ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time
title_full ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time
title_fullStr ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time
title_full_unstemmed ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time
title_short ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time
title_sort esprit-forest: parallel clustering of massive amplicon sequence data in subquadratic time
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5421816/
https://www.ncbi.nlm.nih.gov/pubmed/28437450
http://dx.doi.org/10.1371/journal.pcbi.1005518
work_keys_str_mv AT caiyunpeng espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime
AT zhengwei espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime
AT yaojin espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime
AT yangyujie espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime
AT maivolker espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime
AT maoqi espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime
AT sunyijun espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime