Cargando…

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce

Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detect...

Descripción completa

Detalles Bibliográficos
Autores principales: Tahir, Muhammad, Sardaraz, Muhammad
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7074349/
https://www.ncbi.nlm.nih.gov/pubmed/32033366
http://dx.doi.org/10.3390/genes11020166
_version_ 1783506813524639744
author Tahir, Muhammad
Sardaraz, Muhammad
author_facet Tahir, Muhammad
Sardaraz, Muhammad
author_sort Tahir, Muhammad
collection PubMed
description Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.
format Online
Article
Text
id pubmed-7074349
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-70743492020-03-20 A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce Tahir, Muhammad Sardaraz, Muhammad Genes (Basel) Article Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well. MDPI 2020-02-05 /pmc/articles/PMC7074349/ /pubmed/32033366 http://dx.doi.org/10.3390/genes11020166 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Tahir, Muhammad
Sardaraz, Muhammad
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_full A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_fullStr A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_full_unstemmed A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_short A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
title_sort fast and scalable workflow for snps detection in genome sequences using hadoop map-reduce
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7074349/
https://www.ncbi.nlm.nih.gov/pubmed/32033366
http://dx.doi.org/10.3390/genes11020166
work_keys_str_mv AT tahirmuhammad afastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce
AT sardarazmuhammad afastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce
AT tahirmuhammad fastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce
AT sardarazmuhammad fastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce