Cargando…
A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework
BACKGROUND: State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by prev...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521391/ https://www.ncbi.nlm.nih.gov/pubmed/23282094 http://dx.doi.org/10.1186/1471-2164-13-S7-S28 |
_version_ | 1782252946542559232 |
---|---|
author | Chang, Yu-Jung Chen, Chien-Chih Chen, Chuen-Liang Ho, Jan-Ming |
author_facet | Chang, Yu-Jung Chen, Chien-Chih Chen, Chuen-Liang Ho, Jan-Ming |
author_sort | Chang, Yu-Jung |
collection | PubMed |
description | BACKGROUND: State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. RESULTS: We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush. |
format | Online Article Text |
id | pubmed-3521391 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-35213912012-12-14 A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework Chang, Yu-Jung Chen, Chien-Chih Chen, Chuen-Liang Ho, Jan-Ming BMC Genomics Proceedings BACKGROUND: State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. RESULTS: We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush. BioMed Central 2012-12-07 /pmc/articles/PMC3521391/ /pubmed/23282094 http://dx.doi.org/10.1186/1471-2164-13-S7-S28 Text en Copyright ©2012 Chang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Chang, Yu-Jung Chen, Chien-Chih Chen, Chuen-Liang Ho, Jan-Ming A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework |
title | A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework |
title_full | A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework |
title_fullStr | A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework |
title_full_unstemmed | A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework |
title_short | A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework |
title_sort | de novo next generation genomic sequence assembler based on string graph and mapreduce cloud computing framework |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521391/ https://www.ncbi.nlm.nih.gov/pubmed/23282094 http://dx.doi.org/10.1186/1471-2164-13-S7-S28 |
work_keys_str_mv | AT changyujung adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchienchih adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchuenliang adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT hojanming adenovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT changyujung denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchienchih denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT chenchuenliang denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework AT hojanming denovonextgenerationgenomicsequenceassemblerbasedonstringgraphandmapreducecloudcomputingframework |