Cargando…
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detect...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7074349/ https://www.ncbi.nlm.nih.gov/pubmed/32033366 http://dx.doi.org/10.3390/genes11020166 |
_version_ | 1783506813524639744 |
---|---|
author | Tahir, Muhammad Sardaraz, Muhammad |
author_facet | Tahir, Muhammad Sardaraz, Muhammad |
author_sort | Tahir, Muhammad |
collection | PubMed |
description | Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well. |
format | Online Article Text |
id | pubmed-7074349 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-70743492020-03-20 A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce Tahir, Muhammad Sardaraz, Muhammad Genes (Basel) Article Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well. MDPI 2020-02-05 /pmc/articles/PMC7074349/ /pubmed/32033366 http://dx.doi.org/10.3390/genes11020166 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Tahir, Muhammad Sardaraz, Muhammad A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce |
title | A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce |
title_full | A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce |
title_fullStr | A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce |
title_full_unstemmed | A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce |
title_short | A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce |
title_sort | fast and scalable workflow for snps detection in genome sequences using hadoop map-reduce |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7074349/ https://www.ncbi.nlm.nih.gov/pubmed/32033366 http://dx.doi.org/10.3390/genes11020166 |
work_keys_str_mv | AT tahirmuhammad afastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce AT sardarazmuhammad afastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce AT tahirmuhammad fastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce AT sardarazmuhammad fastandscalableworkflowforsnpsdetectioningenomesequencesusinghadoopmapreduce |