Cargando…
A balanced iterative random forest for gene selection from microarray data
BACKGROUND: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind o...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3766035/ https://www.ncbi.nlm.nih.gov/pubmed/23981907 http://dx.doi.org/10.1186/1471-2105-14-261 |
_version_ | 1782283449205260288 |
---|---|
author | Anaissi, Ali Kennedy, Paul J Goyal, Madhu Catchpoole, Daniel R |
author_facet | Anaissi, Ali Kennedy, Paul J Goyal, Madhu Catchpoole, Daniel R |
author_sort | Anaissi, Ali |
collection | PubMed |
description | BACKGROUND: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease. RESULTS: This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children’s Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists. CONCLUSION: The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive. |
format | Online Article Text |
id | pubmed-3766035 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-37660352013-09-12 A balanced iterative random forest for gene selection from microarray data Anaissi, Ali Kennedy, Paul J Goyal, Madhu Catchpoole, Daniel R BMC Bioinformatics Methodology Article BACKGROUND: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease. RESULTS: This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children’s Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists. CONCLUSION: The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive. BioMed Central 2013-08-27 /pmc/articles/PMC3766035/ /pubmed/23981907 http://dx.doi.org/10.1186/1471-2105-14-261 Text en Copyright © 2013 Anaissi et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Anaissi, Ali Kennedy, Paul J Goyal, Madhu Catchpoole, Daniel R A balanced iterative random forest for gene selection from microarray data |
title | A balanced iterative random forest for gene selection from microarray data |
title_full | A balanced iterative random forest for gene selection from microarray data |
title_fullStr | A balanced iterative random forest for gene selection from microarray data |
title_full_unstemmed | A balanced iterative random forest for gene selection from microarray data |
title_short | A balanced iterative random forest for gene selection from microarray data |
title_sort | balanced iterative random forest for gene selection from microarray data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3766035/ https://www.ncbi.nlm.nih.gov/pubmed/23981907 http://dx.doi.org/10.1186/1471-2105-14-261 |
work_keys_str_mv | AT anaissiali abalancediterativerandomforestforgeneselectionfrommicroarraydata AT kennedypaulj abalancediterativerandomforestforgeneselectionfrommicroarraydata AT goyalmadhu abalancediterativerandomforestforgeneselectionfrommicroarraydata AT catchpooledanielr abalancediterativerandomforestforgeneselectionfrommicroarraydata AT anaissiali balancediterativerandomforestforgeneselectionfrommicroarraydata AT kennedypaulj balancediterativerandomforestforgeneselectionfrommicroarraydata AT goyalmadhu balancediterativerandomforestforgeneselectionfrommicroarraydata AT catchpooledanielr balancediterativerandomforestforgeneselectionfrommicroarraydata |