Cargando…

MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction

Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of model...

Descripción completa

Detalles Bibliográficos
Autores principales: Momeni, Zahra, Saniee Abadeh, Mohammad
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6947642/
https://www.ncbi.nlm.nih.gov/pubmed/31775313
http://dx.doi.org/10.3390/genes10120969
_version_ 1783485598684676096
author Momeni, Zahra
Saniee Abadeh, Mohammad
author_facet Momeni, Zahra
Saniee Abadeh, Mohammad
author_sort Momeni, Zahra
collection PubMed
description Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R(2)) of 95.96% between age and DNAm. In the train data, the MAD and R(2) are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable.
format Online
Article
Text
id pubmed-6947642
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-69476422020-01-13 MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction Momeni, Zahra Saniee Abadeh, Mohammad Genes (Basel) Article Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R(2)) of 95.96% between age and DNAm. In the train data, the MAD and R(2) are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable. MDPI 2019-11-25 /pmc/articles/PMC6947642/ /pubmed/31775313 http://dx.doi.org/10.3390/genes10120969 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Momeni, Zahra
Saniee Abadeh, Mohammad
MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
title MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
title_full MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
title_fullStr MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
title_full_unstemmed MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
title_short MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
title_sort mapreduce-based parallel genetic algorithm for cpg-site selection in age prediction
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6947642/
https://www.ncbi.nlm.nih.gov/pubmed/31775313
http://dx.doi.org/10.3390/genes10120969
work_keys_str_mv AT momenizahra mapreducebasedparallelgeneticalgorithmforcpgsiteselectioninageprediction
AT sanieeabadehmohammad mapreducebasedparallelgeneticalgorithmforcpgsiteselectioninageprediction