Cargando…

Efficient construction of an assembly string graph using the FM-index

Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Simpson, Jared T., Durbin, Richard
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2010
Materias:	Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881401/ https://www.ncbi.nlm.nih.gov/pubmed/20529929 http://dx.doi.org/10.1093/bioinformatics/btq217

_version_	1782182115911139328
author	Simpson, Jared T. Durbin, Richard
author_facet	Simpson, Jared T. Durbin, Richard
author_sort	Simpson, Jared T.
collection	PubMed
description	Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms. Results: Standard overlap assembly methods have time complexity O(N(2)), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly. Contact: js18@sanger.ac.uk
format	Text
id	pubmed-2881401
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-28814012010-06-08 Efficient construction of an assembly string graph using the FM-index Simpson, Jared T. Durbin, Richard Bioinformatics Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms. Results: Standard overlap assembly methods have time complexity O(N(2)), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly. Contact: js18@sanger.ac.uk Oxford University Press 2010-06-15 2010-06-01 /pmc/articles/PMC2881401/ /pubmed/20529929 http://dx.doi.org/10.1093/bioinformatics/btq217 Text en © The Author(s) 2010. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa Simpson, Jared T. Durbin, Richard Efficient construction of an assembly string graph using the FM-index
title	Efficient construction of an assembly string graph using the FM-index
title_full	Efficient construction of an assembly string graph using the FM-index
title_fullStr	Efficient construction of an assembly string graph using the FM-index
title_full_unstemmed	Efficient construction of an assembly string graph using the FM-index
title_short	Efficient construction of an assembly string graph using the FM-index
title_sort	efficient construction of an assembly string graph using the fm-index
topic	Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881401/ https://www.ncbi.nlm.nih.gov/pubmed/20529929 http://dx.doi.org/10.1093/bioinformatics/btq217
work_keys_str_mv	AT simpsonjaredt efficientconstructionofanassemblystringgraphusingthefmindex AT durbinrichard efficientconstructionofanassemblystringgraphusingthefmindex

Efficient construction of an assembly string graph using the FM-index

Ejemplares similares