Cargando…

Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data

BACKGROUND: Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to...

Descripción completa

Detalles Bibliográficos
Autores principales: Ooi, Chia Huey, Chetty, Madhu, Teng, Shyh Wei
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1569877/
https://www.ncbi.nlm.nih.gov/pubmed/16796748
http://dx.doi.org/10.1186/1471-2105-7-320
_version_ 1782130227657310208
author Ooi, Chia Huey
Chetty, Madhu
Teng, Shyh Wei
author_facet Ooi, Chia Huey
Chetty, Madhu
Teng, Shyh Wei
author_sort Ooi, Chia Huey
collection PubMed
description BACKGROUND: Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. RESULTS: We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. CONCLUSION: For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
format Text
id pubmed-1569877
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15698772006-10-02 Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data Ooi, Chia Huey Chetty, Madhu Teng, Shyh Wei BMC Bioinformatics Methodology Article BACKGROUND: Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. RESULTS: We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. CONCLUSION: For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures. BioMed Central 2006-06-23 /pmc/articles/PMC1569877/ /pubmed/16796748 http://dx.doi.org/10.1186/1471-2105-7-320 Text en Copyright © 2006 Ooi et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Ooi, Chia Huey
Chetty, Madhu
Teng, Shyh Wei
Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data
title Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data
title_full Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data
title_fullStr Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data
title_full_unstemmed Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data
title_short Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data
title_sort differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1569877/
https://www.ncbi.nlm.nih.gov/pubmed/16796748
http://dx.doi.org/10.1186/1471-2105-7-320
work_keys_str_mv AT ooichiahuey differentialprioritizationbetweenrelevanceandredundancyincorrelationbasedfeatureselectiontechniquesformulticlassgeneexpressiondata
AT chettymadhu differentialprioritizationbetweenrelevanceandredundancyincorrelationbasedfeatureselectiontechniquesformulticlassgeneexpressiondata
AT tengshyhwei differentialprioritizationbetweenrelevanceandredundancyincorrelationbasedfeatureselectiontechniquesformulticlassgeneexpressiondata