Cargando…

Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools

This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Ogasawara, Takeshi, Cheng, Yinhe, Tzeng, Tzy-Hwa Kathy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5115855/
https://www.ncbi.nlm.nih.gov/pubmed/27861637
http://dx.doi.org/10.1371/journal.pone.0167100
_version_ 1782468584843247616
author Ogasawara, Takeshi
Cheng, Yinhe
Tzeng, Tzy-Hwa Kathy
author_facet Ogasawara, Takeshi
Cheng, Yinhe
Tzeng, Tzy-Hwa Kathy
author_sort Ogasawara, Takeshi
collection PubMed
description This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156–186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators, if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting input data are provided by using plug-in tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of next generation sequencing (NGS) data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime of NGS data pre-processing from about 20 hours to about nine minutes for a whole-genome sequencing data set on the same system using up to 711 GB of memory.
format Online
Article
Text
id pubmed-5115855
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-51158552016-12-08 Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools Ogasawara, Takeshi Cheng, Yinhe Tzeng, Tzy-Hwa Kathy PLoS One Research Article This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156–186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators, if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting input data are provided by using plug-in tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of next generation sequencing (NGS) data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime of NGS data pre-processing from about 20 hours to about nine minutes for a whole-genome sequencing data set on the same system using up to 711 GB of memory. Public Library of Science 2016-11-18 /pmc/articles/PMC5115855/ /pubmed/27861637 http://dx.doi.org/10.1371/journal.pone.0167100 Text en © 2016 Ogasawara et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Ogasawara, Takeshi
Cheng, Yinhe
Tzeng, Tzy-Hwa Kathy
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
title Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
title_full Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
title_fullStr Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
title_full_unstemmed Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
title_short Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
title_sort sam2bam: high-performance framework for ngs data preprocessing tools
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5115855/
https://www.ncbi.nlm.nih.gov/pubmed/27861637
http://dx.doi.org/10.1371/journal.pone.0167100
work_keys_str_mv AT ogasawaratakeshi sam2bamhighperformanceframeworkforngsdatapreprocessingtools
AT chengyinhe sam2bamhighperformanceframeworkforngsdatapreprocessingtools
AT tzengtzyhwakathy sam2bamhighperformanceframeworkforngsdatapreprocessingtools