Cargando…

Feature selection and nearest centroid classification for protein mass spectrometry

BACKGROUND: The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the s...

Descripción completa

Detalles Bibliográficos
Autor principal:	Levner, Ilya
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2005
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1274262/ https://www.ncbi.nlm.nih.gov/pubmed/15788095 http://dx.doi.org/10.1186/1471-2105-6-68

_version_	1782125975859888128
author	Levner, Ilya
author_facet	Levner, Ilya
author_sort	Levner, Ilya
collection	PubMed
description	BACKGROUND: The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry. RESULTS: This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms. Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking. From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection. Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed. In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis. To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection. Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets. In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets. CONCLUSION: This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation. The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.e., non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound.
format	Text
id	pubmed-1274262
institution	National Center for Biotechnology Information
language	English
publishDate	2005
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-12742622005-10-29 Feature selection and nearest centroid classification for protein mass spectrometry Levner, Ilya BMC Bioinformatics Research Article BACKGROUND: The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry. RESULTS: This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms. Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking. From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection. Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed. In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis. To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection. Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets. In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets. CONCLUSION: This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation. The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.e., non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound. BioMed Central 2005-03-23 /pmc/articles/PMC1274262/ /pubmed/15788095 http://dx.doi.org/10.1186/1471-2105-6-68 Text en Copyright © 2005 Levner; licensee BioMed Central Ltd.
spellingShingle	Research Article Levner, Ilya Feature selection and nearest centroid classification for protein mass spectrometry
title	Feature selection and nearest centroid classification for protein mass spectrometry
title_full	Feature selection and nearest centroid classification for protein mass spectrometry
title_fullStr	Feature selection and nearest centroid classification for protein mass spectrometry
title_full_unstemmed	Feature selection and nearest centroid classification for protein mass spectrometry
title_short	Feature selection and nearest centroid classification for protein mass spectrometry
title_sort	feature selection and nearest centroid classification for protein mass spectrometry
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1274262/ https://www.ncbi.nlm.nih.gov/pubmed/15788095 http://dx.doi.org/10.1186/1471-2105-6-68
work_keys_str_mv	AT levnerilya featureselectionandnearestcentroidclassificationforproteinmassspectrometry

Feature selection and nearest centroid classification for protein mass spectrometry

Ejemplares similares