Cargando…
An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples
BACKGROUND: Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459790/ https://www.ncbi.nlm.nih.gov/pubmed/34616883 http://dx.doi.org/10.7717/peerj-cs.671 |
_version_ | 1784571601053810688 |
---|---|
author | Bose, Shilpi Das, Chandra Banerjee, Abhik Ghosh, Kuntal Chattopadhyay, Matangini Chattopadhyay, Samiran Barik, Aishwarya |
author_facet | Bose, Shilpi Das, Chandra Banerjee, Abhik Ghosh, Kuntal Chattopadhyay, Matangini Chattopadhyay, Samiran Barik, Aishwarya |
author_sort | Bose, Shilpi |
collection | PubMed |
description | BACKGROUND: Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. METHODS: In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. RESULTS: To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes. |
format | Online Article Text |
id | pubmed-8459790 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-84597902021-10-05 An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples Bose, Shilpi Das, Chandra Banerjee, Abhik Ghosh, Kuntal Chattopadhyay, Matangini Chattopadhyay, Samiran Barik, Aishwarya PeerJ Comput Sci Bioinformatics BACKGROUND: Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. METHODS: In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. RESULTS: To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes. PeerJ Inc. 2021-09-16 /pmc/articles/PMC8459790/ /pubmed/34616883 http://dx.doi.org/10.7717/peerj-cs.671 Text en © 2021 Bose et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Bose, Shilpi Das, Chandra Banerjee, Abhik Ghosh, Kuntal Chattopadhyay, Matangini Chattopadhyay, Samiran Barik, Aishwarya An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples |
title | An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples |
title_full | An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples |
title_fullStr | An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples |
title_full_unstemmed | An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples |
title_short | An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples |
title_sort | ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459790/ https://www.ncbi.nlm.nih.gov/pubmed/34616883 http://dx.doi.org/10.7717/peerj-cs.671 |
work_keys_str_mv | AT boseshilpi anensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT daschandra anensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT banerjeeabhik anensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT ghoshkuntal anensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT chattopadhyaymatangini anensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT chattopadhyaysamiran anensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT barikaishwarya anensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT boseshilpi ensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT daschandra ensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT banerjeeabhik ensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT ghoshkuntal ensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT chattopadhyaymatangini ensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT chattopadhyaysamiran ensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples AT barikaishwarya ensemblemachinelearningmodelbasedonmultiplefilteringandsupervisedattributeclusteringalgorithmforclassifyingcancersamples |