Cargando…

SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification

A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using...

Descripción completa

Detalles Bibliográficos
Autores principales: Gutkin, Michael, Shamir, Ron, Dror, Gideon
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2715895/
https://www.ncbi.nlm.nih.gov/pubmed/19649265
http://dx.doi.org/10.1371/journal.pone.0006416
_version_ 1782169793561886720
author Gutkin, Michael
Shamir, Ron
Dror, Gideon
author_facet Gutkin, Michael
Shamir, Ron
Dror, Gideon
author_sort Gutkin, Michael
collection PubMed
description A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method's variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique.
format Text
id pubmed-2715895
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-27158952009-08-01 SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification Gutkin, Michael Shamir, Ron Dror, Gideon PLoS One Research Article A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method's variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique. Public Library of Science 2009-07-29 /pmc/articles/PMC2715895/ /pubmed/19649265 http://dx.doi.org/10.1371/journal.pone.0006416 Text en Gutkin et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Gutkin, Michael
Shamir, Ron
Dror, Gideon
SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification
title SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification
title_full SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification
title_fullStr SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification
title_full_unstemmed SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification
title_short SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification
title_sort slimpls: a method for feature selection in gene expression-based disease classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2715895/
https://www.ncbi.nlm.nih.gov/pubmed/19649265
http://dx.doi.org/10.1371/journal.pone.0006416
work_keys_str_mv AT gutkinmichael slimplsamethodforfeatureselectioningeneexpressionbaseddiseaseclassification
AT shamirron slimplsamethodforfeatureselectioningeneexpressionbaseddiseaseclassification
AT drorgideon slimplsamethodforfeatureselectioningeneexpressionbaseddiseaseclassification