Cargando…

Starcode: sequence clustering based on all-pairs search

Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA...

Descripción completa

Detalles Bibliográficos
Autores principales: Zorita, Eduard, Cuscó, Pol, Filion, Guillaume J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765884/
https://www.ncbi.nlm.nih.gov/pubmed/25638815
http://dx.doi.org/10.1093/bioinformatics/btv053
_version_ 1782417589984559104
author Zorita, Eduard
Cuscó, Pol
Filion, Guillaume J.
author_facet Zorita, Eduard
Cuscó, Pol
Filion, Guillaume J.
author_sort Zorita, Eduard
collection PubMed
description Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem. Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman–Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision. Availability and implementation: The C source code is available at http://github.com/gui11aume/starcode. Contact: guillaume.filion@gmail.com
format Online
Article
Text
id pubmed-4765884
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-47658842016-03-04 Starcode: sequence clustering based on all-pairs search Zorita, Eduard Cuscó, Pol Filion, Guillaume J. Bioinformatics Original Papers Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem. Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman–Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision. Availability and implementation: The C source code is available at http://github.com/gui11aume/starcode. Contact: guillaume.filion@gmail.com Oxford University Press 2015-06-15 2015-01-31 /pmc/articles/PMC4765884/ /pubmed/25638815 http://dx.doi.org/10.1093/bioinformatics/btv053 Text en © The Author 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Zorita, Eduard
Cuscó, Pol
Filion, Guillaume J.
Starcode: sequence clustering based on all-pairs search
title Starcode: sequence clustering based on all-pairs search
title_full Starcode: sequence clustering based on all-pairs search
title_fullStr Starcode: sequence clustering based on all-pairs search
title_full_unstemmed Starcode: sequence clustering based on all-pairs search
title_short Starcode: sequence clustering based on all-pairs search
title_sort starcode: sequence clustering based on all-pairs search
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765884/
https://www.ncbi.nlm.nih.gov/pubmed/25638815
http://dx.doi.org/10.1093/bioinformatics/btv053
work_keys_str_mv AT zoritaeduard starcodesequenceclusteringbasedonallpairssearch
AT cuscopol starcodesequenceclusteringbasedonallpairssearch
AT filionguillaumej starcodesequenceclusteringbasedonallpairssearch