Cargando…

Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia

BACKGROUND: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorith...

Descripción completa

Detalles Bibliográficos
Autores principales: Morozova, Olga, Levina, Olga, Uusküla, Anneli, Heimer, Robert
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4553217/
https://www.ncbi.nlm.nih.gov/pubmed/26319135
http://dx.doi.org/10.1186/s12874-015-0066-2
_version_ 1782387852937527296
author Morozova, Olga
Levina, Olga
Uusküla, Anneli
Heimer, Robert
author_facet Morozova, Olga
Levina, Olga
Uusküla, Anneli
Heimer, Robert
author_sort Morozova, Olga
collection PubMed
description BACKGROUND: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research. METHODS: Performance of stepwise (backward elimination and forward selection algorithms using AIC, BIC, and Likelihood Ratio Test, p = 0.05 (LRT)) and alternative subset selection methods in linear regression, including Bayesian model averaging (BMA) and penalized regression (lasso, adaptive lasso, and adaptive elastic net) was investigated in a dataset from a cross-sectional study of drug users in St. Petersburg, Russia in 2012–2013. Dependent variable measured health-related quality of life, and independent correlates included 44 variables measuring demographics, behavioral, and structural factors. RESULTS: In our case study all methods returned models of different size and composition varying from 41 to 11 variables. The percentage of significant variables among those selected in final model varied from 100 % to 27 %. Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero). Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise. By incorporating model uncertainty into subset selection and estimation of coefficients and their standard deviations, BMA returned a parsimonious model with the most conservative results in terms of covariates significance. CONCLUSIONS: BMA and adaptive elastic net performed best in our analysis. Based on our results and previous theoretical studies the use of stepwise methods in medical and epidemiological research may be outperformed by alternative methods in cases such as ours. In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree. We recommend that researchers, at a minimum, should explore model uncertainty and stability as part of their analyses, and report these details in epidemiological papers. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12874-015-0066-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4553217
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45532172015-08-31 Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia Morozova, Olga Levina, Olga Uusküla, Anneli Heimer, Robert BMC Med Res Methodol Research Article BACKGROUND: Automatic stepwise subset selection methods in linear regression often perform poorly, both in terms of variable selection and estimation of coefficients and standard errors, especially when number of independent variables is large and multicollinearity is present. Yet, stepwise algorithms remain the dominant method in medical and epidemiological research. METHODS: Performance of stepwise (backward elimination and forward selection algorithms using AIC, BIC, and Likelihood Ratio Test, p = 0.05 (LRT)) and alternative subset selection methods in linear regression, including Bayesian model averaging (BMA) and penalized regression (lasso, adaptive lasso, and adaptive elastic net) was investigated in a dataset from a cross-sectional study of drug users in St. Petersburg, Russia in 2012–2013. Dependent variable measured health-related quality of life, and independent correlates included 44 variables measuring demographics, behavioral, and structural factors. RESULTS: In our case study all methods returned models of different size and composition varying from 41 to 11 variables. The percentage of significant variables among those selected in final model varied from 100 % to 27 %. Model selection with stepwise methods was highly unstable, with most (and all in case of backward elimination: BIC, forward selection: BIC, and backward elimination: LRT) of the selected variables being significant (95 % confidence interval for coefficient did not include zero). Adaptive elastic net demonstrated improved stability and more conservative estimates of coefficients and standard errors compared to stepwise. By incorporating model uncertainty into subset selection and estimation of coefficients and their standard deviations, BMA returned a parsimonious model with the most conservative results in terms of covariates significance. CONCLUSIONS: BMA and adaptive elastic net performed best in our analysis. Based on our results and previous theoretical studies the use of stepwise methods in medical and epidemiological research may be outperformed by alternative methods in cases such as ours. In situations of high uncertainty it is beneficial to apply different methodologically sound subset selection methods, and explore where their outputs do and do not agree. We recommend that researchers, at a minimum, should explore model uncertainty and stability as part of their analyses, and report these details in epidemiological papers. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12874-015-0066-2) contains supplementary material, which is available to authorized users. BioMed Central 2015-08-30 /pmc/articles/PMC4553217/ /pubmed/26319135 http://dx.doi.org/10.1186/s12874-015-0066-2 Text en © Morozova et al. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Morozova, Olga
Levina, Olga
Uusküla, Anneli
Heimer, Robert
Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
title Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
title_full Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
title_fullStr Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
title_full_unstemmed Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
title_short Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia
title_sort comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in russia
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4553217/
https://www.ncbi.nlm.nih.gov/pubmed/26319135
http://dx.doi.org/10.1186/s12874-015-0066-2
work_keys_str_mv AT morozovaolga comparisonofsubsetselectionmethodsinlinearregressioninthecontextofhealthrelatedqualityoflifeandsubstanceabuseinrussia
AT levinaolga comparisonofsubsetselectionmethodsinlinearregressioninthecontextofhealthrelatedqualityoflifeandsubstanceabuseinrussia
AT uuskulaanneli comparisonofsubsetselectionmethodsinlinearregressioninthecontextofhealthrelatedqualityoflifeandsubstanceabuseinrussia
AT heimerrobert comparisonofsubsetselectionmethodsinlinearregressioninthecontextofhealthrelatedqualityoflifeandsubstanceabuseinrussia