Cargando…

Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes

BACKGROUND: In the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. [1] and the NCI60 dataset of Ross et al. [2] present multicl...

Descripción completa

Detalles Bibliográficos
Autores principales: Jirapech-Umpai, Thanyaluk, Aitken, Stuart
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1181625/
https://www.ncbi.nlm.nih.gov/pubmed/15958165
http://dx.doi.org/10.1186/1471-2105-6-148
_version_ 1782124629967503360
author Jirapech-Umpai, Thanyaluk
Aitken, Stuart
author_facet Jirapech-Umpai, Thanyaluk
Aitken, Stuart
author_sort Jirapech-Umpai, Thanyaluk
collection PubMed
description BACKGROUND: In the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. [1] and the NCI60 dataset of Ross et al. [2] present multiclass classification problems where three tumour types and nine cell lines respectively must be identified. We apply an evolutionary algorithm to identify the near-optimal set of predictive genes that classify the data. We also examine the initial gene selection step whereby the most informative genes are selected from the genes assayed. RESULTS: In the absence of feature selection, classification accuracy on the training data is typically good, but not replicated on the testing data. Gene selection using the RankGene software [3] is shown to significantly improve performance on the testing data. Further, we show that the choice of feature selection criteria can have a significant effect on accuracy. The evolutionary algorithm is shown to perform stably across the space of possible parameter settings – indicating the robustness of the approach. We assess performance using a low variance estimation technique, and present an analysis of the genes most often selected as predictors. CONCLUSION: The computational methods we have developed perform robustly and accurately, and yield results in accord with clinical knowledge: A Z-score analysis of the genes most frequently selected identifies genes known to discriminate AML and Pre-T ALL leukemia. This study also confirms that significantly different sets of genes are found to be most discriminatory as the sample classes are refined.
format Text
id pubmed-1181625
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-11816252005-07-30 Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes Jirapech-Umpai, Thanyaluk Aitken, Stuart BMC Bioinformatics Research Article BACKGROUND: In the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. [1] and the NCI60 dataset of Ross et al. [2] present multiclass classification problems where three tumour types and nine cell lines respectively must be identified. We apply an evolutionary algorithm to identify the near-optimal set of predictive genes that classify the data. We also examine the initial gene selection step whereby the most informative genes are selected from the genes assayed. RESULTS: In the absence of feature selection, classification accuracy on the training data is typically good, but not replicated on the testing data. Gene selection using the RankGene software [3] is shown to significantly improve performance on the testing data. Further, we show that the choice of feature selection criteria can have a significant effect on accuracy. The evolutionary algorithm is shown to perform stably across the space of possible parameter settings – indicating the robustness of the approach. We assess performance using a low variance estimation technique, and present an analysis of the genes most often selected as predictors. CONCLUSION: The computational methods we have developed perform robustly and accurately, and yield results in accord with clinical knowledge: A Z-score analysis of the genes most frequently selected identifies genes known to discriminate AML and Pre-T ALL leukemia. This study also confirms that significantly different sets of genes are found to be most discriminatory as the sample classes are refined. BioMed Central 2005-06-15 /pmc/articles/PMC1181625/ /pubmed/15958165 http://dx.doi.org/10.1186/1471-2105-6-148 Text en Copyright © 2005 Jirapech-Umpai and Aitken; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Jirapech-Umpai, Thanyaluk
Aitken, Stuart
Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
title Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
title_full Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
title_fullStr Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
title_full_unstemmed Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
title_short Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
title_sort feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1181625/
https://www.ncbi.nlm.nih.gov/pubmed/15958165
http://dx.doi.org/10.1186/1471-2105-6-148
work_keys_str_mv AT jirapechumpaithanyaluk featureselectionandclassificationformicroarraydataanalysisevolutionarymethodsforidentifyingpredictivegenes
AT aitkenstuart featureselectionandclassificationformicroarraydataanalysisevolutionarymethodsforidentifyingpredictivegenes