Cargando…

A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands

BACKGROUND: Local policymakers require information about public health, housing and well-being at small geographical areas. A municipality can for example use this information to organize targeted activities with the aim of improving the well-being of their residents. Surveys are often used to gathe...

Descripción completa

Detalles Bibliográficos
Autores principales: Viljanen, Markus, Meijerink, Lotta, Zwakhals, Laurens, van de Kassteele, Jan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9169293/
https://www.ncbi.nlm.nih.gov/pubmed/35668432
http://dx.doi.org/10.1186/s12942-022-00304-5
_version_ 1784721175496097792
author Viljanen, Markus
Meijerink, Lotta
Zwakhals, Laurens
van de Kassteele, Jan
author_facet Viljanen, Markus
Meijerink, Lotta
Zwakhals, Laurens
van de Kassteele, Jan
author_sort Viljanen, Markus
collection PubMed
description BACKGROUND: Local policymakers require information about public health, housing and well-being at small geographical areas. A municipality can for example use this information to organize targeted activities with the aim of improving the well-being of their residents. Surveys are often used to gather data, but many neighborhoods can have only few or even zero respondents. In that case, estimating the status of the local population directly from survey responses is prone to be unreliable. METHODS: Small Area Estimation (SAE) is a technique to provide estimates at small geographical levels with only few or even zero respondents. In classical individual-level SAE, a complex statistical regression model is fitted to the survey responses by using auxiliary administrative data for the population as predictors, the missing responses are then predicted and aggregated to the desired geographical level. In this paper we compare gradient boosted trees (XGBoost), a well-known machine learning technique, to a structured additive regression model (STAR) designed for the specific problem of estimating public health and well-being in the whole population of the Netherlands. RESULTS: We compare the accuracy and performance of these models using out-of-sample predictions with five-fold Cross Validation (5CV). We do this for three data sets of different sample sizes and outcome types. Compared to the STAR model, gradient boosted trees are able to improve both the accuracy of the predictions and the total time taken to get these predictions. Even though the models appear quite similar in overall accuracy, the small area predictions at neighborhood level sometimes differ significantly. It may therefore make sense to pursue slightly more accurate models for better predictions into small areas. However, one of the biggest benefits is that XGBoost does not require prior knowledge or model specification. Data preparation and modelling is much easier, since the method automatically handles missing data, non-linear responses, interactions and accounts for spatial correlation structures. CONCLUSIONS: In this paper we provide new nationwide estimates of health, housing and well-being indicators at neighborhood level in the Netherlands, see ’Online materials’. We demonstrate that machine learning provides a good alternative to complex statistical regression modelling for small area estimation in terms of accuracy, robustness, speed and data preparation. These results can be used to make appropriate policy decisions at a local level and make recommendations about which estimation methods are beneficial in terms of accuracy, time and budget constraints.
format Online
Article
Text
id pubmed-9169293
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-91692932022-06-07 A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands Viljanen, Markus Meijerink, Lotta Zwakhals, Laurens van de Kassteele, Jan Int J Health Geogr Research BACKGROUND: Local policymakers require information about public health, housing and well-being at small geographical areas. A municipality can for example use this information to organize targeted activities with the aim of improving the well-being of their residents. Surveys are often used to gather data, but many neighborhoods can have only few or even zero respondents. In that case, estimating the status of the local population directly from survey responses is prone to be unreliable. METHODS: Small Area Estimation (SAE) is a technique to provide estimates at small geographical levels with only few or even zero respondents. In classical individual-level SAE, a complex statistical regression model is fitted to the survey responses by using auxiliary administrative data for the population as predictors, the missing responses are then predicted and aggregated to the desired geographical level. In this paper we compare gradient boosted trees (XGBoost), a well-known machine learning technique, to a structured additive regression model (STAR) designed for the specific problem of estimating public health and well-being in the whole population of the Netherlands. RESULTS: We compare the accuracy and performance of these models using out-of-sample predictions with five-fold Cross Validation (5CV). We do this for three data sets of different sample sizes and outcome types. Compared to the STAR model, gradient boosted trees are able to improve both the accuracy of the predictions and the total time taken to get these predictions. Even though the models appear quite similar in overall accuracy, the small area predictions at neighborhood level sometimes differ significantly. It may therefore make sense to pursue slightly more accurate models for better predictions into small areas. However, one of the biggest benefits is that XGBoost does not require prior knowledge or model specification. Data preparation and modelling is much easier, since the method automatically handles missing data, non-linear responses, interactions and accounts for spatial correlation structures. CONCLUSIONS: In this paper we provide new nationwide estimates of health, housing and well-being indicators at neighborhood level in the Netherlands, see ’Online materials’. We demonstrate that machine learning provides a good alternative to complex statistical regression modelling for small area estimation in terms of accuracy, robustness, speed and data preparation. These results can be used to make appropriate policy decisions at a local level and make recommendations about which estimation methods are beneficial in terms of accuracy, time and budget constraints. BioMed Central 2022-06-06 /pmc/articles/PMC9169293/ /pubmed/35668432 http://dx.doi.org/10.1186/s12942-022-00304-5 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Viljanen, Markus
Meijerink, Lotta
Zwakhals, Laurens
van de Kassteele, Jan
A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands
title A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands
title_full A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands
title_fullStr A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands
title_full_unstemmed A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands
title_short A machine learning approach to small area estimation: predicting the health, housing and well-being of the population of Netherlands
title_sort machine learning approach to small area estimation: predicting the health, housing and well-being of the population of netherlands
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9169293/
https://www.ncbi.nlm.nih.gov/pubmed/35668432
http://dx.doi.org/10.1186/s12942-022-00304-5
work_keys_str_mv AT viljanenmarkus amachinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands
AT meijerinklotta amachinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands
AT zwakhalslaurens amachinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands
AT vandekassteelejan amachinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands
AT viljanenmarkus machinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands
AT meijerinklotta machinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands
AT zwakhalslaurens machinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands
AT vandekassteelejan machinelearningapproachtosmallareaestimationpredictingthehealthhousingandwellbeingofthepopulationofnetherlands