Cargando…

Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads

Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundament...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ye, Chengxi, Ma, Zhanshan (Sam)
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2016
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4906657/ https://www.ncbi.nlm.nih.gov/pubmed/27330851 http://dx.doi.org/10.7717/peerj.2016

_version_	1782437447108395008
author	Ye, Chengxi Ma, Zhanshan (Sam)
author_facet	Ye, Chengxi Ma, Zhanshan (Sam)
author_sort	Ye, Chengxi
collection	PubMed
description	Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such as de novo genome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences. Results. We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitate de novo genome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, and uses approximately 80% less memory and time. Availability. The source code is available for download at https://github.com/yechengxi/Sparc.
format	Online Article Text
id	pubmed-4906657
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-49066572016-06-17 Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads Ye, Chengxi Ma, Zhanshan (Sam) PeerJ Bioinformatics Motivation. The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such as de novo genome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences. Results. We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitate de novo genome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, and uses approximately 80% less memory and time. Availability. The source code is available for download at https://github.com/yechengxi/Sparc. PeerJ Inc. 2016-06-08 /pmc/articles/PMC4906657/ /pubmed/27330851 http://dx.doi.org/10.7717/peerj.2016 Text en ©2016 Ye and Ma http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Ye, Chengxi Ma, Zhanshan (Sam) Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
title	Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
title_full	Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
title_fullStr	Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
title_full_unstemmed	Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
title_short	Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
title_sort	sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4906657/ https://www.ncbi.nlm.nih.gov/pubmed/27330851 http://dx.doi.org/10.7717/peerj.2016
work_keys_str_mv	AT yechengxi sparcasparsitybasedconsensusalgorithmforlongerroneoussequencingreads AT mazhanshansam sparcasparsitybasedconsensusalgorithmforlongerroneoussequencingreads

Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads

Ejemplares similares