Cargando…
Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach
BACKGROUND: The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environment...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9358850/ https://www.ncbi.nlm.nih.gov/pubmed/35934714 http://dx.doi.org/10.1186/s12859-022-04870-0 |
_version_ | 1784764017186701312 |
---|---|
author | Tai, Kah Yee Dhaliwal, Jasbir Wong, KokSheik |
author_facet | Tai, Kah Yee Dhaliwal, Jasbir Wong, KokSheik |
author_sort | Tai, Kah Yee |
collection | PubMed |
description | BACKGROUND: The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS). RESULTS: We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction. CONCLUSIONS: Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04870-0. |
format | Online Article Text |
id | pubmed-9358850 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-93588502022-08-10 Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach Tai, Kah Yee Dhaliwal, Jasbir Wong, KokSheik BMC Bioinformatics Research BACKGROUND: The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS). RESULTS: We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction. CONCLUSIONS: Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04870-0. BioMed Central 2022-08-07 /pmc/articles/PMC9358850/ /pubmed/35934714 http://dx.doi.org/10.1186/s12859-022-04870-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Tai, Kah Yee Dhaliwal, Jasbir Wong, KokSheik Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach |
title | Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach |
title_full | Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach |
title_fullStr | Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach |
title_full_unstemmed | Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach |
title_short | Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach |
title_sort | risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9358850/ https://www.ncbi.nlm.nih.gov/pubmed/35934714 http://dx.doi.org/10.1186/s12859-022-04870-0 |
work_keys_str_mv | AT taikahyee riskscorepredictionmodelbasedonsinglenucleotidepolymorphismforpredictingmalariaamachinelearningapproach AT dhaliwaljasbir riskscorepredictionmodelbasedonsinglenucleotidepolymorphismforpredictingmalariaamachinelearningapproach AT wongkoksheik riskscorepredictionmodelbasedonsinglenucleotidepolymorphismforpredictingmalariaamachinelearningapproach |