Cargando…

SlideSort: all pairs similarity search for short reads

Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other im...

Descripción completa

Detalles Bibliográficos
Autores principales: Shimizu, Kana, Tsuda, Koji
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035798/
https://www.ncbi.nlm.nih.gov/pubmed/21148542
http://dx.doi.org/10.1093/bioinformatics/btq677
_version_ 1782197815074619392
author Shimizu, Kana
Tsuda, Koji
author_facet Shimizu, Kana
Tsuda, Koji
author_sort Shimizu, Kana
collection PubMed
description Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing. Availability: Executable binary files and C++ libraries are available at http://www.cbrc.jp/~shimizu/slidesort/ for Linux and Windows. Contact: slidesort@m.aist.go.jp; shimizu-kana@aist.go.jp Supplementary information: Supplementary data are available at Bioinformatics online.
format Text
id pubmed-3035798
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-30357982011-02-09 SlideSort: all pairs similarity search for short reads Shimizu, Kana Tsuda, Koji Bioinformatics Original Papers Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing. Availability: Executable binary files and C++ libraries are available at http://www.cbrc.jp/~shimizu/slidesort/ for Linux and Windows. Contact: slidesort@m.aist.go.jp; shimizu-kana@aist.go.jp Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2011-02-15 2010-12-09 /pmc/articles/PMC3035798/ /pubmed/21148542 http://dx.doi.org/10.1093/bioinformatics/btq677 Text en © The Author(s) 2010. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.5 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Shimizu, Kana
Tsuda, Koji
SlideSort: all pairs similarity search for short reads
title SlideSort: all pairs similarity search for short reads
title_full SlideSort: all pairs similarity search for short reads
title_fullStr SlideSort: all pairs similarity search for short reads
title_full_unstemmed SlideSort: all pairs similarity search for short reads
title_short SlideSort: all pairs similarity search for short reads
title_sort slidesort: all pairs similarity search for short reads
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035798/
https://www.ncbi.nlm.nih.gov/pubmed/21148542
http://dx.doi.org/10.1093/bioinformatics/btq677
work_keys_str_mv AT shimizukana slidesortallpairssimilaritysearchforshortreads
AT tsudakoji slidesortallpairssimilaritysearchforshortreads