Cargando…

biobambam: tools for read pair collation based algorithms on BAM files

BACKGROUND: Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the ma...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tischler, German, Leonard, Steven
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Software Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4075596/ http://dx.doi.org/10.1186/1751-0473-9-13

_version_	1782323360008503296
author	Tischler, German Leonard, Steven
author_facet	Tischler, German Leonard, Steven
author_sort	Tischler, German
collection	PubMed
description	BACKGROUND: Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. RESULTS: In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package. CONCLUSIONS: In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets.
format	Online Article Text
id	pubmed-4075596
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40755962014-07-01 biobambam: tools for read pair collation based algorithms on BAM files Tischler, German Leonard, Steven Source Code Biol Med Software Review BACKGROUND: Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. RESULTS: In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package. CONCLUSIONS: In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets. BioMed Central 2014-06-20 /pmc/articles/PMC4075596/ http://dx.doi.org/10.1186/1751-0473-9-13 Text en Copyright © 2014 Tischler and Leonard; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle	Software Review Tischler, German Leonard, Steven biobambam: tools for read pair collation based algorithms on BAM files
title	biobambam: tools for read pair collation based algorithms on BAM files
title_full	biobambam: tools for read pair collation based algorithms on BAM files
title_fullStr	biobambam: tools for read pair collation based algorithms on BAM files
title_full_unstemmed	biobambam: tools for read pair collation based algorithms on BAM files
title_short	biobambam: tools for read pair collation based algorithms on BAM files
title_sort	biobambam: tools for read pair collation based algorithms on bam files
topic	Software Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4075596/ http://dx.doi.org/10.1186/1751-0473-9-13
work_keys_str_mv	AT tischlergerman biobambamtoolsforreadpaircollationbasedalgorithmsonbamfiles AT leonardsteven biobambamtoolsforreadpaircollationbasedalgorithmsonbamfiles

biobambam: tools for read pair collation based algorithms on BAM files

Ejemplares similares