Cargando…

Random KNN feature selection - a fast and stable alternative to Random Forests

BACKGROUND: Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Shengqiao, Harner, E James, Adjeroh, Donald A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281073/
https://www.ncbi.nlm.nih.gov/pubmed/22093447
http://dx.doi.org/10.1186/1471-2105-12-450
_version_ 1782223914343071744
author Li, Shengqiao
Harner, E James
Adjeroh, Donald A
author_facet Li, Shengqiao
Harner, E James
Adjeroh, Donald A
author_sort Li, Shengqiao
collection PubMed
description BACKGROUND: Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such "small n, large p problems." However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs. RESULTS: We present RKNN-FS, an innovative feature selection procedure for "small n, large p problems." RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes. CONCLUSIONS: Given the superiority of Random KNN in classification performance when compared with Random Forests, RKNN-FS's simplicity and ease of implementation, and its superiority in speed and stability, we propose RKNN-FS as a faster and more stable alternative to Random Forests in classification problems involving feature selection for high-dimensional datasets.
format Online
Article
Text
id pubmed-3281073
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32810732012-02-17 Random KNN feature selection - a fast and stable alternative to Random Forests Li, Shengqiao Harner, E James Adjeroh, Donald A BMC Bioinformatics Methodology Article BACKGROUND: Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such "small n, large p problems." However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs. RESULTS: We present RKNN-FS, an innovative feature selection procedure for "small n, large p problems." RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes. CONCLUSIONS: Given the superiority of Random KNN in classification performance when compared with Random Forests, RKNN-FS's simplicity and ease of implementation, and its superiority in speed and stability, we propose RKNN-FS as a faster and more stable alternative to Random Forests in classification problems involving feature selection for high-dimensional datasets. BioMed Central 2011-11-18 /pmc/articles/PMC3281073/ /pubmed/22093447 http://dx.doi.org/10.1186/1471-2105-12-450 Text en Copyright ©2011 Li et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Li, Shengqiao
Harner, E James
Adjeroh, Donald A
Random KNN feature selection - a fast and stable alternative to Random Forests
title Random KNN feature selection - a fast and stable alternative to Random Forests
title_full Random KNN feature selection - a fast and stable alternative to Random Forests
title_fullStr Random KNN feature selection - a fast and stable alternative to Random Forests
title_full_unstemmed Random KNN feature selection - a fast and stable alternative to Random Forests
title_short Random KNN feature selection - a fast and stable alternative to Random Forests
title_sort random knn feature selection - a fast and stable alternative to random forests
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281073/
https://www.ncbi.nlm.nih.gov/pubmed/22093447
http://dx.doi.org/10.1186/1471-2105-12-450
work_keys_str_mv AT lishengqiao randomknnfeatureselectionafastandstablealternativetorandomforests
AT harnerejames randomknnfeatureselectionafastandstablealternativetorandomforests
AT adjerohdonalda randomknnfeatureselectionafastandstablealternativetorandomforests