Cargando…

Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting

Despite the prominent use of complex survey data and the growing popularity of machine learning methods in epidemiologic research, few machine learning software implementations offer options for handling complex samples. A major challenge impeding the broader incorporation of machine learning into e...

Descripción completa

Detalles Bibliográficos
Autores principales:	MacNell, Nathaniel, Feinstein, Lydia, Wilkerson, Jesse, Salo, Pӓivi M., Molsberry, Samantha A., Fessler, Michael B., Thorne, Peter S., Motsinger-Reif, Alison A., Zeldin, Darryl C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9838837/ https://www.ncbi.nlm.nih.gov/pubmed/36638125 http://dx.doi.org/10.1371/journal.pone.0280387

_version_	1784869361581817856
author	MacNell, Nathaniel Feinstein, Lydia Wilkerson, Jesse Salo, Pӓivi M. Molsberry, Samantha A. Fessler, Michael B. Thorne, Peter S. Motsinger-Reif, Alison A. Zeldin, Darryl C.
author_facet	MacNell, Nathaniel Feinstein, Lydia Wilkerson, Jesse Salo, Pӓivi M. Molsberry, Samantha A. Fessler, Michael B. Thorne, Peter S. Motsinger-Reif, Alison A. Zeldin, Darryl C.
author_sort	MacNell, Nathaniel
collection	PubMed
description	Despite the prominent use of complex survey data and the growing popularity of machine learning methods in epidemiologic research, few machine learning software implementations offer options for handling complex samples. A major challenge impeding the broader incorporation of machine learning into epidemiologic research is incomplete guidance for analyzing complex survey data, including the importance of sampling weights for valid prediction in target populations. Using data from 15, 820 participants in the 1988–1994 National Health and Nutrition Examination Survey cohort, we determined whether ignoring weights in gradient boosting models of all-cause mortality affected prediction, as measured by the F1 score and corresponding 95% confidence intervals. In simulations, we additionally assessed the impact of sample size, weight variability, predictor strength, and model dimensionality. In the National Health and Nutrition Examination Survey data, unweighted model performance was inflated compared to the weighted model (F1 score 81.9% [95% confidence interval: 81.2%, 82.7%] vs 77.4% [95% confidence interval: 76.1%, 78.6%]). However, the error was mitigated if the F1 score was subsequently recalculated with observed outcomes from the weighted dataset (F1: 77.0%; 95% confidence interval: 75.7%, 78.4%). In simulations, this finding held in the largest sample size (N = 10,000) under all analytic conditions assessed. For sample sizes <5,000, sampling weights had little impact in simulations that more closely resembled a simple random sample (low weight variability) or in models with strong predictors, but findings were inconsistent under other analytic scenarios. Failing to account for sampling weights in gradient boosting models may limit generalizability for data from complex surveys, dependent on sample size and other analytic properties. In the absence of software for configuring weighted algorithms, post-hoc re-calculations of unweighted model performance using weighted observed outcomes may more accurately reflect model prediction in target populations than ignoring weights entirely.
format	Online Article Text
id	pubmed-9838837
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-98388372023-01-14 Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting MacNell, Nathaniel Feinstein, Lydia Wilkerson, Jesse Salo, Pӓivi M. Molsberry, Samantha A. Fessler, Michael B. Thorne, Peter S. Motsinger-Reif, Alison A. Zeldin, Darryl C. PLoS One Research Article Despite the prominent use of complex survey data and the growing popularity of machine learning methods in epidemiologic research, few machine learning software implementations offer options for handling complex samples. A major challenge impeding the broader incorporation of machine learning into epidemiologic research is incomplete guidance for analyzing complex survey data, including the importance of sampling weights for valid prediction in target populations. Using data from 15, 820 participants in the 1988–1994 National Health and Nutrition Examination Survey cohort, we determined whether ignoring weights in gradient boosting models of all-cause mortality affected prediction, as measured by the F1 score and corresponding 95% confidence intervals. In simulations, we additionally assessed the impact of sample size, weight variability, predictor strength, and model dimensionality. In the National Health and Nutrition Examination Survey data, unweighted model performance was inflated compared to the weighted model (F1 score 81.9% [95% confidence interval: 81.2%, 82.7%] vs 77.4% [95% confidence interval: 76.1%, 78.6%]). However, the error was mitigated if the F1 score was subsequently recalculated with observed outcomes from the weighted dataset (F1: 77.0%; 95% confidence interval: 75.7%, 78.4%). In simulations, this finding held in the largest sample size (N = 10,000) under all analytic conditions assessed. For sample sizes <5,000, sampling weights had little impact in simulations that more closely resembled a simple random sample (low weight variability) or in models with strong predictors, but findings were inconsistent under other analytic scenarios. Failing to account for sampling weights in gradient boosting models may limit generalizability for data from complex surveys, dependent on sample size and other analytic properties. In the absence of software for configuring weighted algorithms, post-hoc re-calculations of unweighted model performance using weighted observed outcomes may more accurately reflect model prediction in target populations than ignoring weights entirely. Public Library of Science 2023-01-13 /pmc/articles/PMC9838837/ /pubmed/36638125 http://dx.doi.org/10.1371/journal.pone.0280387 Text en https://creativecommons.org/publicdomain/zero/1.0/This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle	Research Article MacNell, Nathaniel Feinstein, Lydia Wilkerson, Jesse Salo, Pӓivi M. Molsberry, Samantha A. Fessler, Michael B. Thorne, Peter S. Motsinger-Reif, Alison A. Zeldin, Darryl C. Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting
title	Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting
title_full	Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting
title_fullStr	Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting
title_full_unstemmed	Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting
title_short	Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting
title_sort	implementing machine learning methods with complex survey data: lessons learned on the impacts of accounting sampling weights in gradient boosting
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9838837/ https://www.ncbi.nlm.nih.gov/pubmed/36638125 http://dx.doi.org/10.1371/journal.pone.0280387
work_keys_str_mv	AT macnellnathaniel implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT feinsteinlydia implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT wilkersonjesse implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT salopäivim implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT molsberrysamanthaa implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT fesslermichaelb implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT thornepeters implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT motsingerreifalisona implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting AT zeldindarrylc implementingmachinelearningmethodswithcomplexsurveydatalessonslearnedontheimpactsofaccountingsamplingweightsingradientboosting

Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting

Ejemplares similares