Cargando…
MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of model...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6947642/ https://www.ncbi.nlm.nih.gov/pubmed/31775313 http://dx.doi.org/10.3390/genes10120969 |
_version_ | 1783485598684676096 |
---|---|
author | Momeni, Zahra Saniee Abadeh, Mohammad |
author_facet | Momeni, Zahra Saniee Abadeh, Mohammad |
author_sort | Momeni, Zahra |
collection | PubMed |
description | Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R(2)) of 95.96% between age and DNAm. In the train data, the MAD and R(2) are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable. |
format | Online Article Text |
id | pubmed-6947642 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-69476422020-01-13 MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction Momeni, Zahra Saniee Abadeh, Mohammad Genes (Basel) Article Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R(2)) of 95.96% between age and DNAm. In the train data, the MAD and R(2) are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable. MDPI 2019-11-25 /pmc/articles/PMC6947642/ /pubmed/31775313 http://dx.doi.org/10.3390/genes10120969 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Momeni, Zahra Saniee Abadeh, Mohammad MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction |
title | MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction |
title_full | MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction |
title_fullStr | MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction |
title_full_unstemmed | MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction |
title_short | MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction |
title_sort | mapreduce-based parallel genetic algorithm for cpg-site selection in age prediction |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6947642/ https://www.ncbi.nlm.nih.gov/pubmed/31775313 http://dx.doi.org/10.3390/genes10120969 |
work_keys_str_mv | AT momenizahra mapreducebasedparallelgeneticalgorithmforcpgsiteselectioninageprediction AT sanieeabadehmohammad mapreducebasedparallelgeneticalgorithmforcpgsiteselectioninageprediction |