Cargando…

Starcode: sequence clustering based on all-pairs search

Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zorita, Eduard, Cuscó, Pol, Filion, Guillaume J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2015
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765884/ https://www.ncbi.nlm.nih.gov/pubmed/25638815 http://dx.doi.org/10.1093/bioinformatics/btv053

_version_	1782417589984559104
author	Zorita, Eduard Cuscó, Pol Filion, Guillaume J.
author_facet	Zorita, Eduard Cuscó, Pol Filion, Guillaume J.
author_sort	Zorita, Eduard
collection	PubMed
description	Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem. Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman–Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision. Availability and implementation: The C source code is available at http://github.com/gui11aume/starcode. Contact: guillaume.filion@gmail.com
format	Online Article Text
id	pubmed-4765884
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-47658842016-03-04 Starcode: sequence clustering based on all-pairs search Zorita, Eduard Cuscó, Pol Filion, Guillaume J. Bioinformatics Original Papers Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem. Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman–Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision. Availability and implementation: The C source code is available at http://github.com/gui11aume/starcode. Contact: guillaume.filion@gmail.com Oxford University Press 2015-06-15 2015-01-31 /pmc/articles/PMC4765884/ /pubmed/25638815 http://dx.doi.org/10.1093/bioinformatics/btv053 Text en © The Author 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Papers Zorita, Eduard Cuscó, Pol Filion, Guillaume J. Starcode: sequence clustering based on all-pairs search
title	Starcode: sequence clustering based on all-pairs search
title_full	Starcode: sequence clustering based on all-pairs search
title_fullStr	Starcode: sequence clustering based on all-pairs search
title_full_unstemmed	Starcode: sequence clustering based on all-pairs search
title_short	Starcode: sequence clustering based on all-pairs search
title_sort	starcode: sequence clustering based on all-pairs search
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765884/ https://www.ncbi.nlm.nih.gov/pubmed/25638815 http://dx.doi.org/10.1093/bioinformatics/btv053
work_keys_str_mv	AT zoritaeduard starcodesequenceclusteringbasedonallpairssearch AT cuscopol starcodesequenceclusteringbasedonallpairssearch AT filionguillaumej starcodesequenceclusteringbasedonallpairssearch

Starcode: sequence clustering based on all-pairs search

Ejemplares similares