Cargando…

SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data

Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Yuxin, Chen, Yongsheng, Shi, Chunmei, Huang, Zhibo, Zhang, Yong, Li, Shengkang, Li, Yan, Ye, Jia, Yu, Chang, Li, Zhuo, Zhang, Xiuqing, Wang, Jian, Yang, Huanming, Fang, Lin, Chen, Qiang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788068/
https://www.ncbi.nlm.nih.gov/pubmed/29220494
http://dx.doi.org/10.1093/gigascience/gix120
_version_ 1783296045780828160
author Chen, Yuxin
Chen, Yongsheng
Shi, Chunmei
Huang, Zhibo
Zhang, Yong
Li, Shengkang
Li, Yan
Ye, Jia
Yu, Chang
Li, Zhuo
Zhang, Xiuqing
Wang, Jian
Yang, Huanming
Fang, Lin
Chen, Qiang
author_facet Chen, Yuxin
Chen, Yongsheng
Shi, Chunmei
Huang, Zhibo
Zhang, Yong
Li, Shengkang
Li, Yan
Ye, Jia
Yu, Chang
Li, Zhuo
Zhang, Xiuqing
Wang, Jian
Yang, Huanming
Fang, Lin
Chen, Qiang
author_sort Chen, Yuxin
collection PubMed
description Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we demonstrate SOAPnuke as a tool with abundant functions for a “QC-Preprocess-QC” workflow and MapReduce acceleration framework. Four modules with different preprocessing functions are designed for processing datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments, respectively. As a workflow-like tool, SOAPnuke centralizes processing functions into 1 executable and predefines their order to avoid the necessity of reformatting different files when switching tools. Furthermore, the MapReduce framework enables large scalability to distribute all the processing works to an entire compute cluster. We conducted a benchmarking where SOAPnuke and other tools are used to preprocess a ∼30× NA12878 dataset published by GIAB. The standalone operation of SOAPnuke struck a balance between resource occupancy and performance. When accelerated on 16 working nodes with MapReduce, SOAPnuke achieved ∼5.7 times the fastest speed of other tools.
format Online
Article
Text
id pubmed-5788068
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-57880682018-02-02 SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data Chen, Yuxin Chen, Yongsheng Shi, Chunmei Huang, Zhibo Zhang, Yong Li, Shengkang Li, Yan Ye, Jia Yu, Chang Li, Zhuo Zhang, Xiuqing Wang, Jian Yang, Huanming Fang, Lin Chen, Qiang Gigascience Technical Note Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we demonstrate SOAPnuke as a tool with abundant functions for a “QC-Preprocess-QC” workflow and MapReduce acceleration framework. Four modules with different preprocessing functions are designed for processing datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments, respectively. As a workflow-like tool, SOAPnuke centralizes processing functions into 1 executable and predefines their order to avoid the necessity of reformatting different files when switching tools. Furthermore, the MapReduce framework enables large scalability to distribute all the processing works to an entire compute cluster. We conducted a benchmarking where SOAPnuke and other tools are used to preprocess a ∼30× NA12878 dataset published by GIAB. The standalone operation of SOAPnuke struck a balance between resource occupancy and performance. When accelerated on 16 working nodes with MapReduce, SOAPnuke achieved ∼5.7 times the fastest speed of other tools. Oxford University Press 2017-12-04 /pmc/articles/PMC5788068/ /pubmed/29220494 http://dx.doi.org/10.1093/gigascience/gix120 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Chen, Yuxin
Chen, Yongsheng
Shi, Chunmei
Huang, Zhibo
Zhang, Yong
Li, Shengkang
Li, Yan
Ye, Jia
Yu, Chang
Li, Zhuo
Zhang, Xiuqing
Wang, Jian
Yang, Huanming
Fang, Lin
Chen, Qiang
SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
title SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
title_full SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
title_fullStr SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
title_full_unstemmed SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
title_short SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
title_sort soapnuke: a mapreduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788068/
https://www.ncbi.nlm.nih.gov/pubmed/29220494
http://dx.doi.org/10.1093/gigascience/gix120
work_keys_str_mv AT chenyuxin soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT chenyongsheng soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT shichunmei soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT huangzhibo soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT zhangyong soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT lishengkang soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT liyan soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT yejia soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT yuchang soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT lizhuo soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT zhangxiuqing soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT wangjian soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT yanghuanming soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT fanglin soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata
AT chenqiang soapnukeamapreduceaccelerationsupportedsoftwareforintegratedqualitycontrolandpreprocessingofhighthroughputsequencingdata