Cargando…

A balanced iterative random forest for gene selection from microarray data

BACKGROUND: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind o...

Descripción completa

Detalles Bibliográficos
Autores principales: Anaissi, Ali, Kennedy, Paul J, Goyal, Madhu, Catchpoole, Daniel R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3766035/
https://www.ncbi.nlm.nih.gov/pubmed/23981907
http://dx.doi.org/10.1186/1471-2105-14-261
_version_ 1782283449205260288
author Anaissi, Ali
Kennedy, Paul J
Goyal, Madhu
Catchpoole, Daniel R
author_facet Anaissi, Ali
Kennedy, Paul J
Goyal, Madhu
Catchpoole, Daniel R
author_sort Anaissi, Ali
collection PubMed
description BACKGROUND: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease. RESULTS: This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children’s Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists. CONCLUSION: The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive.
format Online
Article
Text
id pubmed-3766035
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-37660352013-09-12 A balanced iterative random forest for gene selection from microarray data Anaissi, Ali Kennedy, Paul J Goyal, Madhu Catchpoole, Daniel R BMC Bioinformatics Methodology Article BACKGROUND: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease. RESULTS: This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children’s Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists. CONCLUSION: The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive. BioMed Central 2013-08-27 /pmc/articles/PMC3766035/ /pubmed/23981907 http://dx.doi.org/10.1186/1471-2105-14-261 Text en Copyright © 2013 Anaissi et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Anaissi, Ali
Kennedy, Paul J
Goyal, Madhu
Catchpoole, Daniel R
A balanced iterative random forest for gene selection from microarray data
title A balanced iterative random forest for gene selection from microarray data
title_full A balanced iterative random forest for gene selection from microarray data
title_fullStr A balanced iterative random forest for gene selection from microarray data
title_full_unstemmed A balanced iterative random forest for gene selection from microarray data
title_short A balanced iterative random forest for gene selection from microarray data
title_sort balanced iterative random forest for gene selection from microarray data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3766035/
https://www.ncbi.nlm.nih.gov/pubmed/23981907
http://dx.doi.org/10.1186/1471-2105-14-261
work_keys_str_mv AT anaissiali abalancediterativerandomforestforgeneselectionfrommicroarraydata
AT kennedypaulj abalancediterativerandomforestforgeneselectionfrommicroarraydata
AT goyalmadhu abalancediterativerandomforestforgeneselectionfrommicroarraydata
AT catchpooledanielr abalancediterativerandomforestforgeneselectionfrommicroarraydata
AT anaissiali balancediterativerandomforestforgeneselectionfrommicroarraydata
AT kennedypaulj balancediterativerandomforestforgeneselectionfrommicroarraydata
AT goyalmadhu balancediterativerandomforestforgeneselectionfrommicroarraydata
AT catchpooledanielr balancediterativerandomforestforgeneselectionfrommicroarraydata