Cargando…

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce

Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detect...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tahir, Muhammad, Sardaraz, Muhammad
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7074349/ https://www.ncbi.nlm.nih.gov/pubmed/32033366 http://dx.doi.org/10.3390/genes11020166

_version_	1783506813524639744
author	Tahir, Muhammad Sardaraz, Muhammad
author_facet	Tahir, Muhammad Sardaraz, Muhammad
author_sort	Tahir, Muhammad
collection	PubMed
description	Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.
format	Online Article Text
id	pubmed-7074349
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-70743492020-03-20 A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce Tahir, Muhammad Sardaraz, Muhammad Genes (Basel) Article Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well. MDPI 2020-02-05 /pmc/articles/PMC7074349/ /pubmed/32033366 http://dx.doi.org/10.3390/genes11020166 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Tahir, Muhammad Sardaraz, Muhammad A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title	A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_full	A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_fullStr	A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_full_unstemmed	A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_short	A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_sort	fast and scalable workflow for snps detection in genome sequences using hadoop map-reduce
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7074349/ https://www.ncbi.nlm.nih.gov/pubmed/32033366 http://dx.doi.org/10.3390/genes11020166
work_keys_str_mv	AT tahirmuhammad afastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce AT sardarazmuhammad afastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce AT tahirmuhammad fastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce AT sardarazmuhammad fastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce

Ejemplares similares