Cargando…

Deep Ensemble Machine Learning Framework for the Estimation of [Formula: see text] Concentrations

BACKGROUND: Accurate estimation of historical [Formula: see text] (particle matter with an aerodynamic diameter of less than [Formula: see text]) is critical and essential for environmental health risk assessment. OBJECTIVES: The aim of this study was to develop a multiple-level stacked ensemble mac...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Wenhua, Li, Shanshan, Ye, Tingting, Xu, Rongbin, Song, Jiangning, Guo, Yuming
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Environmental Health Perspectives 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8901043/
https://www.ncbi.nlm.nih.gov/pubmed/35254864
http://dx.doi.org/10.1289/EHP9752
Descripción
Sumario:BACKGROUND: Accurate estimation of historical [Formula: see text] (particle matter with an aerodynamic diameter of less than [Formula: see text]) is critical and essential for environmental health risk assessment. OBJECTIVES: The aim of this study was to develop a multiple-level stacked ensemble machine learning framework for improving the estimation of the daily ground-level [Formula: see text] concentrations. METHODS: An innovative deep ensemble machine learning framework (DEML) was developed to estimate the daily [Formula: see text] concentrations. The framework has a three-stage structure: At the first stage, four base models [gradient boosting machine (GBM), support vector machine (SVM), random forest (RF), and eXtreme gradient boosting (XGBoost)] were used to generate a new data set of [Formula: see text] concentrations for training the next-stage learners. At the second stage, three meta-models [RF, XGBoost, and Generalized Linear Model (GLM)] were used to estimate [Formula: see text] concentrations using a combination of the original data set and the predictions from the first-stage models. At the third stage, a nonnegative least squares (NNLS) algorithm was employed to obtain the optimal weights for [Formula: see text] estimation. We took the data from 133 monitoring stations in Italy as an example to implement the DEML to predict daily [Formula: see text] at each [Formula: see text] grid cell from 2015 to 2019 across Italy. We evaluated the model performance by performing 10-fold cross-validation (CV) and compared it with five benchmark algorithms [GBM, SVM, RF, XGBoost, and Super Learner (SL)]. RESULTS: The results revealed that the [Formula: see text] prediction performance of DEML [coefficients of determination [Formula: see text] and root mean square error [Formula: see text]] was superior to any benchmark models (with [Formula: see text] of 0.51, 0.76, 0.83, 0.70, and 0.83 for GBM, SVM, RF, XGBoost, and SL approach, respectively). DEML displayed reliable performance in capturing the spatiotemporal variations of [Formula: see text] in Italy. DISCUSSION: The proposed DEML framework achieved an outstanding performance in [Formula: see text] estimation, which could be used as a tool for more accurate environmental exposure assessment. https://doi.org/10.1289/EHP9752