Cargando…
Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers
Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explor...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9675216/ https://www.ncbi.nlm.nih.gov/pubmed/36401182 http://dx.doi.org/10.1186/s12859-022-05050-w |
_version_ | 1784833323476975616 |
---|---|
author | Pasha Syed, Abdu Rehaman Anbalagan, Rahul Setlur, Anagha S. Karunakaran, Chandrashekar Shetty, Jyoti Kumar, Jitendra Niranjan, Vidya |
author_facet | Pasha Syed, Abdu Rehaman Anbalagan, Rahul Setlur, Anagha S. Karunakaran, Chandrashekar Shetty, Jyoti Kumar, Jitendra Niranjan, Vidya |
author_sort | Pasha Syed, Abdu Rehaman |
collection | PubMed |
description | Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05050-w. |
format | Online Article Text |
id | pubmed-9675216 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-96752162022-11-20 Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers Pasha Syed, Abdu Rehaman Anbalagan, Rahul Setlur, Anagha S. Karunakaran, Chandrashekar Shetty, Jyoti Kumar, Jitendra Niranjan, Vidya BMC Bioinformatics Research Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-05050-w. BioMed Central 2022-11-18 /pmc/articles/PMC9675216/ /pubmed/36401182 http://dx.doi.org/10.1186/s12859-022-05050-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Pasha Syed, Abdu Rehaman Anbalagan, Rahul Setlur, Anagha S. Karunakaran, Chandrashekar Shetty, Jyoti Kumar, Jitendra Niranjan, Vidya Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers |
title | Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers |
title_full | Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers |
title_fullStr | Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers |
title_full_unstemmed | Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers |
title_short | Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers |
title_sort | implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9675216/ https://www.ncbi.nlm.nih.gov/pubmed/36401182 http://dx.doi.org/10.1186/s12859-022-05050-w |
work_keys_str_mv | AT pashasyedabdurehaman implementationofensemblemachinelearningalgorithmsonexomedatasetsforpredictingearlydiagnosisofcancers AT anbalaganrahul implementationofensemblemachinelearningalgorithmsonexomedatasetsforpredictingearlydiagnosisofcancers AT setluranaghas implementationofensemblemachinelearningalgorithmsonexomedatasetsforpredictingearlydiagnosisofcancers AT karunakaranchandrashekar implementationofensemblemachinelearningalgorithmsonexomedatasetsforpredictingearlydiagnosisofcancers AT shettyjyoti implementationofensemblemachinelearningalgorithmsonexomedatasetsforpredictingearlydiagnosisofcancers AT kumarjitendra implementationofensemblemachinelearningalgorithmsonexomedatasetsforpredictingearlydiagnosisofcancers AT niranjanvidya implementationofensemblemachinelearningalgorithmsonexomedatasetsforpredictingearlydiagnosisofcancers |