Cargando…

SOPRA: Scaffolding algorithm for paired reads via statistical optimization

BACKGROUND: High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be...

Descripción completa

Detalles Bibliográficos
Autores principales: Dayarian, Adel, Michael, Todd P, Sengupta, Anirvan M
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909219/
https://www.ncbi.nlm.nih.gov/pubmed/20576136
http://dx.doi.org/10.1186/1471-2105-11-345
_version_ 1782184289920614400
author Dayarian, Adel
Michael, Todd P
Sengupta, Anirvan M
author_facet Dayarian, Adel
Michael, Todd P
Sengupta, Anirvan M
author_sort Dayarian, Adel
collection PubMed
description BACKGROUND: High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome. RESULTS: We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors. CONCLUSIONS: Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.
format Text
id pubmed-2909219
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-29092192010-07-24 SOPRA: Scaffolding algorithm for paired reads via statistical optimization Dayarian, Adel Michael, Todd P Sengupta, Anirvan M BMC Bioinformatics Research Article BACKGROUND: High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome. RESULTS: We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors. CONCLUSIONS: Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data. BioMed Central 2010-06-24 /pmc/articles/PMC2909219/ /pubmed/20576136 http://dx.doi.org/10.1186/1471-2105-11-345 Text en Copyright ©2010 Dayarian et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Dayarian, Adel
Michael, Todd P
Sengupta, Anirvan M
SOPRA: Scaffolding algorithm for paired reads via statistical optimization
title SOPRA: Scaffolding algorithm for paired reads via statistical optimization
title_full SOPRA: Scaffolding algorithm for paired reads via statistical optimization
title_fullStr SOPRA: Scaffolding algorithm for paired reads via statistical optimization
title_full_unstemmed SOPRA: Scaffolding algorithm for paired reads via statistical optimization
title_short SOPRA: Scaffolding algorithm for paired reads via statistical optimization
title_sort sopra: scaffolding algorithm for paired reads via statistical optimization
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909219/
https://www.ncbi.nlm.nih.gov/pubmed/20576136
http://dx.doi.org/10.1186/1471-2105-11-345
work_keys_str_mv AT dayarianadel soprascaffoldingalgorithmforpairedreadsviastatisticaloptimization
AT michaeltoddp soprascaffoldingalgorithmforpairedreadsviastatisticaloptimization
AT senguptaanirvanm soprascaffoldingalgorithmforpairedreadsviastatisticaloptimization