Cargando…

Parallel short sequence assembly of transcriptomes

BACKGROUND: The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jackson, Benjamin G, Schnable, Patrick S, Aluru, Srinivas
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648799/ https://www.ncbi.nlm.nih.gov/pubmed/19208113 http://dx.doi.org/10.1186/1471-2105-10-S1-S14

_version_	1782164991068078080
author	Jackson, Benjamin G Schnable, Patrick S Aluru, Srinivas
author_facet	Jackson, Benjamin G Schnable, Patrick S Aluru, Srinivas
author_sort	Jackson, Benjamin G
collection	PubMed
description	BACKGROUND: The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes. RESULTS: We present a parallel method for transcriptome assembly from large short sequence data sets. Our solution uses a rigorous graph theoretic framework and tames the computational and space complexity using parallel computers. First, we construct a distributed bidirected graph that captures overlap information. Next, we compact all chains in this graph to determine long unique contigs using undirected parallel list ranking, a problem for which we present an algorithm. Finally, we process this compacted distributed graph to resolve unique regions that are separated by repeats, exploiting the naturally occurring coverage variations arising from differential expression. CONCLUSION: We demonstrate the validity of our method using a synthetic high coverage data set generated from the predicted coding regions of Zea mays. We assemble 925 million sequences consisting of 40 billion nucleotides in a few minutes on a 1024 processor Blue Gene/L. Our method is the first fully distributed method for assembling a non-hierarchical short sequence data set and can scale to large problem sizes.
format	Text
id	pubmed-2648799
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26487992009-03-03 Parallel short sequence assembly of transcriptomes Jackson, Benjamin G Schnable, Patrick S Aluru, Srinivas BMC Bioinformatics Research BACKGROUND: The de novo assembly of genomes and transcriptomes from short sequences is a challenging problem. Because of the high coverage needed to assemble short sequences as well as the overhead of modeling the assembly problem as a graph problem, the methods for short sequence assembly are often validated using data from BACs or small sized prokaryotic genomes. RESULTS: We present a parallel method for transcriptome assembly from large short sequence data sets. Our solution uses a rigorous graph theoretic framework and tames the computational and space complexity using parallel computers. First, we construct a distributed bidirected graph that captures overlap information. Next, we compact all chains in this graph to determine long unique contigs using undirected parallel list ranking, a problem for which we present an algorithm. Finally, we process this compacted distributed graph to resolve unique regions that are separated by repeats, exploiting the naturally occurring coverage variations arising from differential expression. CONCLUSION: We demonstrate the validity of our method using a synthetic high coverage data set generated from the predicted coding regions of Zea mays. We assemble 925 million sequences consisting of 40 billion nucleotides in a few minutes on a 1024 processor Blue Gene/L. Our method is the first fully distributed method for assembling a non-hierarchical short sequence data set and can scale to large problem sizes. BioMed Central 2009-01-30 /pmc/articles/PMC2648799/ /pubmed/19208113 http://dx.doi.org/10.1186/1471-2105-10-S1-S14 Text en Copyright © 2009 Jackson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Jackson, Benjamin G Schnable, Patrick S Aluru, Srinivas Parallel short sequence assembly of transcriptomes
title	Parallel short sequence assembly of transcriptomes
title_full	Parallel short sequence assembly of transcriptomes
title_fullStr	Parallel short sequence assembly of transcriptomes
title_full_unstemmed	Parallel short sequence assembly of transcriptomes
title_short	Parallel short sequence assembly of transcriptomes
title_sort	parallel short sequence assembly of transcriptomes
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648799/ https://www.ncbi.nlm.nih.gov/pubmed/19208113 http://dx.doi.org/10.1186/1471-2105-10-S1-S14
work_keys_str_mv	AT jacksonbenjaming parallelshortsequenceassemblyoftranscriptomes AT schnablepatricks parallelshortsequenceassemblyoftranscriptomes AT alurusrinivas parallelshortsequenceassemblyoftranscriptomes

Parallel short sequence assembly of transcriptomes

Ejemplares similares