Cargando…

SEED: efficient clustering of next-generation sequences

Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bao, Ergude, Jiang, Tao, Kaloshian, Isgouhi, Girke, Thomas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2011
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167058/ https://www.ncbi.nlm.nih.gov/pubmed/21810899 http://dx.doi.org/10.1093/bioinformatics/btr447

_version_	1782211217685741568
author	Bao, Ergude Jiang, Tao Kaloshian, Isgouhi Girke, Thomas
author_facet	Bao, Ergude Jiang, Tao Kaloshian, Isgouhi Girke, Thomas
author_sort	Bao, Ergude
collection	PubMed
description	Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms. Availability: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed. Contact: thomas.girke@ucr.edu Supplementary information: Supplementary data are available at Bioinformatics online
format	Online Article Text
id	pubmed-3167058
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-31670582011-09-06 SEED: efficient clustering of next-generation sequences Bao, Ergude Jiang, Tao Kaloshian, Isgouhi Girke, Thomas Bioinformatics Original Papers Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms. Availability: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed. Contact: thomas.girke@ucr.edu Supplementary information: Supplementary data are available at Bioinformatics online Oxford University Press 2011-09-15 2011-08-02 /pmc/articles/PMC3167058/ /pubmed/21810899 http://dx.doi.org/10.1093/bioinformatics/btr447 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Bao, Ergude Jiang, Tao Kaloshian, Isgouhi Girke, Thomas SEED: efficient clustering of next-generation sequences
title	SEED: efficient clustering of next-generation sequences
title_full	SEED: efficient clustering of next-generation sequences
title_fullStr	SEED: efficient clustering of next-generation sequences
title_full_unstemmed	SEED: efficient clustering of next-generation sequences
title_short	SEED: efficient clustering of next-generation sequences
title_sort	seed: efficient clustering of next-generation sequences
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167058/ https://www.ncbi.nlm.nih.gov/pubmed/21810899 http://dx.doi.org/10.1093/bioinformatics/btr447
work_keys_str_mv	AT baoergude seedefficientclusteringofnextgenerationsequences AT jiangtao seedefficientclusteringofnextgenerationsequences AT kaloshianisgouhi seedefficientclusteringofnextgenerationsequences AT girkethomas seedefficientclusteringofnextgenerationsequences

SEED: efficient clustering of next-generation sequences

Ejemplares similares