Cargando…

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-di...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nguyen, Thanh-Tung, Huang, Joshua Zhexue, Nguyen, Thuy Thi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi Publishing Corporation 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4387916/ https://www.ncbi.nlm.nih.gov/pubmed/25879059 http://dx.doi.org/10.1155/2015/471371

_version_	1782365343715426304
author	Nguyen, Thanh-Tung Huang, Joshua Zhexue Nguyen, Thuy Thi
author_facet	Nguyen, Thanh-Tung Huang, Joshua Zhexue Nguyen, Thuy Thi
author_sort	Nguyen, Thanh-Tung
collection	PubMed
description	Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
format	Online Article Text
id	pubmed-4387916
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Hindawi Publishing Corporation
record_format	MEDLINE/PubMed
spelling	pubmed-43879162015-04-15 Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data Nguyen, Thanh-Tung Huang, Joshua Zhexue Nguyen, Thuy Thi ScientificWorldJournal Research Article Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures. Hindawi Publishing Corporation 2015 2015-03-24 /pmc/articles/PMC4387916/ /pubmed/25879059 http://dx.doi.org/10.1155/2015/471371 Text en Copyright © 2015 Thanh-Tung Nguyen et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Nguyen, Thanh-Tung Huang, Joshua Zhexue Nguyen, Thuy Thi Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_full	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_fullStr	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_full_unstemmed	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_short	Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
title_sort	unbiased feature selection in learning random forests for high-dimensional data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4387916/ https://www.ncbi.nlm.nih.gov/pubmed/25879059 http://dx.doi.org/10.1155/2015/471371
work_keys_str_mv	AT nguyenthanhtung unbiasedfeatureselectioninlearningrandomforestsforhighdimensionaldata AT huangjoshuazhexue unbiasedfeatureselectioninlearningrandomforestsforhighdimensionaldata AT nguyenthuythi unbiasedfeatureselectioninlearningrandomforestsforhighdimensionaldata

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

Ejemplares similares