Cargando…

Performance of variable selection methods using stability-based selection

BACKGROUND: Variable selection is frequently carried out during the analysis of many types of high-dimensional data, including those in metabolomics. This study compared the predictive performance of four variable selection methods using stability-based selection, a new secondary selection method th...

Descripción completa

Detalles Bibliográficos
Autores principales: Lu, Danny, Weljie, Aalim, de Leon, Alexander R., McConnell, Yarrow, Bathe, Oliver F., Kopciuk, Karen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5379604/
https://www.ncbi.nlm.nih.gov/pubmed/28376847
http://dx.doi.org/10.1186/s13104-017-2461-8
Descripción
Sumario:BACKGROUND: Variable selection is frequently carried out during the analysis of many types of high-dimensional data, including those in metabolomics. This study compared the predictive performance of four variable selection methods using stability-based selection, a new secondary selection method that is implemented in the R package BioMark. Two of these methods were evaluated using the more well-known false discovery rate (FDR) as well. RESULTS: Simulation studies varied factors relevant to biological data studies, with results based on the median values of 200 partial area under the receiver operating characteristic curve. There was no single top performing method across all factor settings, but the student t test based on stability selection or with FDR adjustment and the variable importance in projection (VIP) scores from partial least squares regression models obtained using a stability-based approach tended to perform well in most settings. Similar results were found with a real spiked-in metabolomics dataset. Group sample size, group effect size, number of significant variables and correlation structure were the most important factors whereas the percentage of significant variables was the least important. CONCLUSIONS: Researchers can improve prediction scores for their study data by choosing VIP scores based on stability variable selection over the other approaches when the number of variables is small to modest and by increasing the number of samples even moderately. When the number of variables is high and there is block correlation amongst the significant variables (i.e., true biomarkers), the FDR-adjusted student t test performed best. The R package BioMark is an easy-to-use open-source program for variable selection that had excellent performance characteristics for the purposes of this study.