Cargando…

Risk estimation using probability machines

BACKGROUND: Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effec...

Descripción completa

Detalles Bibliográficos
Autores principales: Dasgupta, Abhijit, Szymczak, Silke, Moore, Jason H, Bailey-Wilson, Joan E, Malley, James D
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015350/
https://www.ncbi.nlm.nih.gov/pubmed/24581306
http://dx.doi.org/10.1186/1756-0381-7-2
_version_ 1782315322872692736
author Dasgupta, Abhijit
Szymczak, Silke
Moore, Jason H
Bailey-Wilson, Joan E
Malley, James D
author_facet Dasgupta, Abhijit
Szymczak, Silke
Moore, Jason H
Bailey-Wilson, Joan E
Malley, James D
author_sort Dasgupta, Abhijit
collection PubMed
description BACKGROUND: Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios. RESULTS: We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented. CONCLUSIONS: The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from.
format Online
Article
Text
id pubmed-4015350
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40153502014-05-23 Risk estimation using probability machines Dasgupta, Abhijit Szymczak, Silke Moore, Jason H Bailey-Wilson, Joan E Malley, James D BioData Min Methodology BACKGROUND: Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios. RESULTS: We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented. CONCLUSIONS: The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from. BioMed Central 2014-03-01 /pmc/articles/PMC4015350/ /pubmed/24581306 http://dx.doi.org/10.1186/1756-0381-7-2 Text en Copyright © 2014 Dasgupta et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Methodology
Dasgupta, Abhijit
Szymczak, Silke
Moore, Jason H
Bailey-Wilson, Joan E
Malley, James D
Risk estimation using probability machines
title Risk estimation using probability machines
title_full Risk estimation using probability machines
title_fullStr Risk estimation using probability machines
title_full_unstemmed Risk estimation using probability machines
title_short Risk estimation using probability machines
title_sort risk estimation using probability machines
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015350/
https://www.ncbi.nlm.nih.gov/pubmed/24581306
http://dx.doi.org/10.1186/1756-0381-7-2
work_keys_str_mv AT dasguptaabhijit riskestimationusingprobabilitymachines
AT szymczaksilke riskestimationusingprobabilitymachines
AT moorejasonh riskestimationusingprobabilitymachines
AT baileywilsonjoane riskestimationusingprobabilitymachines
AT malleyjamesd riskestimationusingprobabilitymachines