Cargando…

Acceleration of sequence clustering using longest common subsequence filtering

BACKGROUND: Huge numbers of genomes can now be sequenced rapidly with recent improvements in sequencing throughput. However, data analysis methods have not kept up, making it difficult to process the vast amounts of available sequence data. This increased processing time is especially critical in DN...

Descripción completa

Detalles Bibliográficos
Autores principales:	Namiki, Youhei, Ishida, Takashi, Akiyama, Yutaka
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3654901/ https://www.ncbi.nlm.nih.gov/pubmed/23815271 http://dx.doi.org/10.1186/1471-2105-14-S8-S7

_version_	1782269789147758592
author	Namiki, Youhei Ishida, Takashi Akiyama, Yutaka
author_facet	Namiki, Youhei Ishida, Takashi Akiyama, Yutaka
author_sort	Namiki, Youhei
collection	PubMed
description	BACKGROUND: Huge numbers of genomes can now be sequenced rapidly with recent improvements in sequencing throughput. However, data analysis methods have not kept up, making it difficult to process the vast amounts of available sequence data. This increased processing time is especially critical in DNA sequence clustering because of the intrinsic difficulty in parallelization. Thus, there is a strong demand for a faster clustering algorithm. RESULTS: We developed a new fast DNA sequence clustering method called LCS-HIT, based on the popular CD-HIT program. The proposed method uses a novel filtering technique based on the longest common subsequence to identify similar sequence pairs. This filtering technique makes the LCS-HIT considerably faster than CD-HIT, without loss of sensitivity. For a dataset of two million DNA sequences, our method was approximately 7.1, 4.4, and 2.2 times faster than CD-HIT for 100, 150, and 400 bases, respectively. CONCLUSIONS: The LCS-HIT clustering program, using a novel filtering technique based on the longest common subsequence, is significantly faster than CD-HIT without compromising clustering accuracy. Moreover, the filtering technique itself is independent from the CD-HIT algorithm. Thus, this technique can be applied to similar clustering algorithms.
format	Online Article Text
id	pubmed-3654901
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-36549012013-05-20 Acceleration of sequence clustering using longest common subsequence filtering Namiki, Youhei Ishida, Takashi Akiyama, Yutaka BMC Bioinformatics Proceedings BACKGROUND: Huge numbers of genomes can now be sequenced rapidly with recent improvements in sequencing throughput. However, data analysis methods have not kept up, making it difficult to process the vast amounts of available sequence data. This increased processing time is especially critical in DNA sequence clustering because of the intrinsic difficulty in parallelization. Thus, there is a strong demand for a faster clustering algorithm. RESULTS: We developed a new fast DNA sequence clustering method called LCS-HIT, based on the popular CD-HIT program. The proposed method uses a novel filtering technique based on the longest common subsequence to identify similar sequence pairs. This filtering technique makes the LCS-HIT considerably faster than CD-HIT, without loss of sensitivity. For a dataset of two million DNA sequences, our method was approximately 7.1, 4.4, and 2.2 times faster than CD-HIT for 100, 150, and 400 bases, respectively. CONCLUSIONS: The LCS-HIT clustering program, using a novel filtering technique based on the longest common subsequence, is significantly faster than CD-HIT without compromising clustering accuracy. Moreover, the filtering technique itself is independent from the CD-HIT algorithm. Thus, this technique can be applied to similar clustering algorithms. BioMed Central 2013-05-09 /pmc/articles/PMC3654901/ /pubmed/23815271 http://dx.doi.org/10.1186/1471-2105-14-S8-S7 Text en Copyright © 2013 Namiki et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Namiki, Youhei Ishida, Takashi Akiyama, Yutaka Acceleration of sequence clustering using longest common subsequence filtering
title	Acceleration of sequence clustering using longest common subsequence filtering
title_full	Acceleration of sequence clustering using longest common subsequence filtering
title_fullStr	Acceleration of sequence clustering using longest common subsequence filtering
title_full_unstemmed	Acceleration of sequence clustering using longest common subsequence filtering
title_short	Acceleration of sequence clustering using longest common subsequence filtering
title_sort	acceleration of sequence clustering using longest common subsequence filtering
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3654901/ https://www.ncbi.nlm.nih.gov/pubmed/23815271 http://dx.doi.org/10.1186/1471-2105-14-S8-S7
work_keys_str_mv	AT namikiyouhei accelerationofsequenceclusteringusinglongestcommonsubsequencefiltering AT ishidatakashi accelerationofsequenceclusteringusinglongestcommonsubsequencefiltering AT akiyamayutaka accelerationofsequenceclusteringusinglongestcommonsubsequencefiltering

Acceleration of sequence clustering using longest common subsequence filtering

Ejemplares similares