Cargando…

Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

BACKGROUND: Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensio...

Descripción completa

Detalles Bibliográficos
Autores principales: Yousef, Malik, Jung, Segun, Showe, Louise C, Showe, Michael K
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1877816/
https://www.ncbi.nlm.nih.gov/pubmed/17474999
http://dx.doi.org/10.1186/1471-2105-8-144
_version_ 1782133567402278912
author Yousef, Malik
Jung, Segun
Showe, Louise C
Showe, Michael K
author_facet Yousef, Malik
Jung, Segun
Showe, Louise C
Showe, Michael K
author_sort Yousef, Malik
collection PubMed
description BACKGROUND: Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. RESULTS: We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. CONCLUSION: SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.
format Text
id pubmed-1877816
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18778162007-05-25 Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data Yousef, Malik Jung, Segun Showe, Louise C Showe, Michael K BMC Bioinformatics Research Article BACKGROUND: Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. RESULTS: We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. CONCLUSION: SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful. BioMed Central 2007-05-02 /pmc/articles/PMC1877816/ /pubmed/17474999 http://dx.doi.org/10.1186/1471-2105-8-144 Text en Copyright © 2007 Yousef et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Yousef, Malik
Jung, Segun
Showe, Louise C
Showe, Michael K
Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data
title Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data
title_full Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data
title_fullStr Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data
title_full_unstemmed Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data
title_short Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data
title_sort recursive cluster elimination (rce) for classification and feature selection from gene expression data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1877816/
https://www.ncbi.nlm.nih.gov/pubmed/17474999
http://dx.doi.org/10.1186/1471-2105-8-144
work_keys_str_mv AT yousefmalik recursiveclustereliminationrceforclassificationandfeatureselectionfromgeneexpressiondata
AT jungsegun recursiveclustereliminationrceforclassificationandfeatureselectionfromgeneexpressiondata
AT showelouisec recursiveclustereliminationrceforclassificationandfeatureselectionfromgeneexpressiondata
AT showemichaelk recursiveclustereliminationrceforclassificationandfeatureselectionfromgeneexpressiondata