Cargando…

CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R

Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposu...

Descripción completa

Detalles Bibliográficos
Autores principales: McCoy, David, Hubbard, Alan, Van der Laan, Mark
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312067/
https://www.ncbi.nlm.nih.gov/pubmed/37398941
http://dx.doi.org/10.21105/joss.04181
_version_ 1785066876717498368
author McCoy, David
Hubbard, Alan
Van der Laan, Mark
author_facet McCoy, David
Hubbard, Alan
Van der Laan, Mark
author_sort McCoy, David
collection PubMed
description Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposures poorly estimates the joint impact of a collection of the same exposures in a realistic exposure setting. Marginal methods for mixture variable selection such as ridge/lasso regression are biased by linear assumptions and the interactions modeled are chosen by the user. Clustering methods such as principal component regression lose both interpretability and valid inference. Newer mixture methods such as quantile g-computation (Keil et al., 2020) are biased by linear/additive assumptions. More flexible methods such as Bayesian kernel machine regression (BKMR)(Bobb et al., 2014) are sensitive to the choice of tuning parameters, are computationally taxing and lack an interpretable and robust summary statistic of dose-response relationships. No methods currently exist which finds the best flexible model to adjust for covariates while applying a non-parametric model that targets for interactions in a mixture and delivers valid inference for a target parameter. Non-parametric methods such as decision trees are a useful tool to evaluate combined exposures by finding partitions in the joint-exposure (mixture) space that best explain the variance in an outcome. However, current methods using decision trees to assess statistical inference for interactions are biased and are prone to overfitting by using the full data to both identify nodes in the tree and make statistical inference given these nodes. Other methods have used an independent test set to derive inference which does not use the full data. The CVtreeMLE R package provides researchers in (bio)statistics, epidemiology, and environmental health sciences with access to state-of-the-art statistical methodology for evaluating the causal effects of a data-adaptively determined mixed exposure using decision trees. Our target audience are those analysts who would normally use a potentially biased GLM based model for a mixed exposure. Instead, we hope to provide users with a non-parametric statistical machine where users simply specify the exposures, covariates and outcome, CVtreeMLE then determines if a best fitting decision tree exists and delivers interpretable results.
format Online
Article
Text
id pubmed-10312067
institution National Center for Biotechnology Information
language English
publishDate 2023
record_format MEDLINE/PubMed
spelling pubmed-103120672023-06-30 CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R McCoy, David Hubbard, Alan Van der Laan, Mark J Open Source Softw Article Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposures poorly estimates the joint impact of a collection of the same exposures in a realistic exposure setting. Marginal methods for mixture variable selection such as ridge/lasso regression are biased by linear assumptions and the interactions modeled are chosen by the user. Clustering methods such as principal component regression lose both interpretability and valid inference. Newer mixture methods such as quantile g-computation (Keil et al., 2020) are biased by linear/additive assumptions. More flexible methods such as Bayesian kernel machine regression (BKMR)(Bobb et al., 2014) are sensitive to the choice of tuning parameters, are computationally taxing and lack an interpretable and robust summary statistic of dose-response relationships. No methods currently exist which finds the best flexible model to adjust for covariates while applying a non-parametric model that targets for interactions in a mixture and delivers valid inference for a target parameter. Non-parametric methods such as decision trees are a useful tool to evaluate combined exposures by finding partitions in the joint-exposure (mixture) space that best explain the variance in an outcome. However, current methods using decision trees to assess statistical inference for interactions are biased and are prone to overfitting by using the full data to both identify nodes in the tree and make statistical inference given these nodes. Other methods have used an independent test set to derive inference which does not use the full data. The CVtreeMLE R package provides researchers in (bio)statistics, epidemiology, and environmental health sciences with access to state-of-the-art statistical methodology for evaluating the causal effects of a data-adaptively determined mixed exposure using decision trees. Our target audience are those analysts who would normally use a potentially biased GLM based model for a mixed exposure. Instead, we hope to provide users with a non-parametric statistical machine where users simply specify the exposures, covariates and outcome, CVtreeMLE then determines if a best fitting decision tree exists and delivers interpretable results. 2023 2023-02-21 /pmc/articles/PMC10312067/ /pubmed/37398941 http://dx.doi.org/10.21105/joss.04181 Text en https://creativecommons.org/licenses/by/4.0/License Authors of papers retain copyright and release the work under a Creative Commons Attribution 4.0 International License (CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) ).
spellingShingle Article
McCoy, David
Hubbard, Alan
Van der Laan, Mark
CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R
title CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R
title_full CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R
title_fullStr CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R
title_full_unstemmed CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R
title_short CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R
title_sort cvtreemle: efficient estimation of mixed exposures using data adaptive decision trees and cross-validated targeted maximum likelihood estimation in r
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312067/
https://www.ncbi.nlm.nih.gov/pubmed/37398941
http://dx.doi.org/10.21105/joss.04181
work_keys_str_mv AT mccoydavid cvtreemleefficientestimationofmixedexposuresusingdataadaptivedecisiontreesandcrossvalidatedtargetedmaximumlikelihoodestimationinr
AT hubbardalan cvtreemleefficientestimationofmixedexposuresusingdataadaptivedecisiontreesandcrossvalidatedtargetedmaximumlikelihoodestimationinr
AT vanderlaanmark cvtreemleefficientestimationofmixedexposuresusingdataadaptivedecisiontreesandcrossvalidatedtargetedmaximumlikelihoodestimationinr