Cargando…

SVM-RFE: selection and visualization of the most relevant features through non-linear kernels

BACKGROUND: Support vector machines (SVM) are a powerful tool to analyze data with a number of predictors approximately equal or larger than the number of observations. However, originally, application of SVM to analyze biomedical data was limited because SVM was not designed to evaluate importance...

Descripción completa

Detalles Bibliográficos
Autores principales: Sanz, Hector, Valim, Clarissa, Vegas, Esteban, Oller, Josep M., Reverter, Ferran
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245920/
https://www.ncbi.nlm.nih.gov/pubmed/30453885
http://dx.doi.org/10.1186/s12859-018-2451-4
_version_ 1783372344749719552
author Sanz, Hector
Valim, Clarissa
Vegas, Esteban
Oller, Josep M.
Reverter, Ferran
author_facet Sanz, Hector
Valim, Clarissa
Vegas, Esteban
Oller, Josep M.
Reverter, Ferran
author_sort Sanz, Hector
collection PubMed
description BACKGROUND: Support vector machines (SVM) are a powerful tool to analyze data with a number of predictors approximately equal or larger than the number of observations. However, originally, application of SVM to analyze biomedical data was limited because SVM was not designed to evaluate importance of predictor variables. Creating predictor models based on only the most relevant variables is essential in biomedical research. Currently, substantial work has been done to allow assessment of variable importance in SVM models but this work has focused on SVM implemented with linear kernels. The power of SVM as a prediction model is associated with the flexibility generated by use of non-linear kernels. Moreover, SVM has been extended to model survival outcomes. This paper extends the Recursive Feature Elimination (RFE) algorithm by proposing three approaches to rank variables based on non-linear SVM and SVM for survival analysis. RESULTS: The proposed algorithms allows visualization of each one the RFE iterations, and hence, identification of the most relevant predictors of the response variable. Using simulation studies based on time-to-event outcomes and three real datasets, we evaluate the three methods, based on pseudo-samples and kernel principal component analysis, and compare them with the original SVM-RFE algorithm for non-linear kernels. The three algorithms we proposed performed generally better than the gold standard RFE for non-linear kernels, when comparing the truly most relevant variables with the variable ranks produced by each algorithm in simulation studies. Generally, the RFE-pseudo-samples outperformed the other three methods, even when variables were assumed to be correlated in all tested scenarios. CONCLUSIONS: The proposed approaches can be implemented with accuracy to select variables and assess direction and strength of associations in analysis of biomedical data using SVM for categorical or time-to-event responses. Conducting variable selection and interpreting direction and strength of associations between predictors and outcomes with the proposed approaches, particularly with the RFE-pseudo-samples approach can be implemented with accuracy when analyzing biomedical data. These approaches, perform better than the classical RFE of Guyon for realistic scenarios about the structure of biomedical data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2451-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6245920
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-62459202018-11-26 SVM-RFE: selection and visualization of the most relevant features through non-linear kernels Sanz, Hector Valim, Clarissa Vegas, Esteban Oller, Josep M. Reverter, Ferran BMC Bioinformatics Methodology Article BACKGROUND: Support vector machines (SVM) are a powerful tool to analyze data with a number of predictors approximately equal or larger than the number of observations. However, originally, application of SVM to analyze biomedical data was limited because SVM was not designed to evaluate importance of predictor variables. Creating predictor models based on only the most relevant variables is essential in biomedical research. Currently, substantial work has been done to allow assessment of variable importance in SVM models but this work has focused on SVM implemented with linear kernels. The power of SVM as a prediction model is associated with the flexibility generated by use of non-linear kernels. Moreover, SVM has been extended to model survival outcomes. This paper extends the Recursive Feature Elimination (RFE) algorithm by proposing three approaches to rank variables based on non-linear SVM and SVM for survival analysis. RESULTS: The proposed algorithms allows visualization of each one the RFE iterations, and hence, identification of the most relevant predictors of the response variable. Using simulation studies based on time-to-event outcomes and three real datasets, we evaluate the three methods, based on pseudo-samples and kernel principal component analysis, and compare them with the original SVM-RFE algorithm for non-linear kernels. The three algorithms we proposed performed generally better than the gold standard RFE for non-linear kernels, when comparing the truly most relevant variables with the variable ranks produced by each algorithm in simulation studies. Generally, the RFE-pseudo-samples outperformed the other three methods, even when variables were assumed to be correlated in all tested scenarios. CONCLUSIONS: The proposed approaches can be implemented with accuracy to select variables and assess direction and strength of associations in analysis of biomedical data using SVM for categorical or time-to-event responses. Conducting variable selection and interpreting direction and strength of associations between predictors and outcomes with the proposed approaches, particularly with the RFE-pseudo-samples approach can be implemented with accuracy when analyzing biomedical data. These approaches, perform better than the classical RFE of Guyon for realistic scenarios about the structure of biomedical data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2451-4) contains supplementary material, which is available to authorized users. BioMed Central 2018-11-19 /pmc/articles/PMC6245920/ /pubmed/30453885 http://dx.doi.org/10.1186/s12859-018-2451-4 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Sanz, Hector
Valim, Clarissa
Vegas, Esteban
Oller, Josep M.
Reverter, Ferran
SVM-RFE: selection and visualization of the most relevant features through non-linear kernels
title SVM-RFE: selection and visualization of the most relevant features through non-linear kernels
title_full SVM-RFE: selection and visualization of the most relevant features through non-linear kernels
title_fullStr SVM-RFE: selection and visualization of the most relevant features through non-linear kernels
title_full_unstemmed SVM-RFE: selection and visualization of the most relevant features through non-linear kernels
title_short SVM-RFE: selection and visualization of the most relevant features through non-linear kernels
title_sort svm-rfe: selection and visualization of the most relevant features through non-linear kernels
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6245920/
https://www.ncbi.nlm.nih.gov/pubmed/30453885
http://dx.doi.org/10.1186/s12859-018-2451-4
work_keys_str_mv AT sanzhector svmrfeselectionandvisualizationofthemostrelevantfeaturesthroughnonlinearkernels
AT valimclarissa svmrfeselectionandvisualizationofthemostrelevantfeaturesthroughnonlinearkernels
AT vegasesteban svmrfeselectionandvisualizationofthemostrelevantfeaturesthroughnonlinearkernels
AT ollerjosepm svmrfeselectionandvisualizationofthemostrelevantfeaturesthroughnonlinearkernels
AT reverterferran svmrfeselectionandvisualizationofthemostrelevantfeaturesthroughnonlinearkernels