Cargando…

Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression

National-scale empirical models for air pollution can include hundreds of geographic variables. The impact of model parsimony (i.e., how model performance differs for a large versus small number of covariates) has not been systematically explored. We aim to (1) build annual-average integrated empiri...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Sun-Young, Bechle, Matthew, Hankey, Steve, Sheppard, Lianne, Szpiro, Adam A., Marshall, Julian D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7028280/
https://www.ncbi.nlm.nih.gov/pubmed/32069301
http://dx.doi.org/10.1371/journal.pone.0228535
_version_ 1783498994342690816
author Kim, Sun-Young
Bechle, Matthew
Hankey, Steve
Sheppard, Lianne
Szpiro, Adam A.
Marshall, Julian D.
author_facet Kim, Sun-Young
Bechle, Matthew
Hankey, Steve
Sheppard, Lianne
Szpiro, Adam A.
Marshall, Julian D.
author_sort Kim, Sun-Young
collection PubMed
description National-scale empirical models for air pollution can include hundreds of geographic variables. The impact of model parsimony (i.e., how model performance differs for a large versus small number of covariates) has not been systematically explored. We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants during 1979–2015; (2) explore systematically the impact on model performance of the number of variables selected for inclusion in a model; and (3) provide publicly available model predictions. We compute annual-average concentrations from regulatory monitoring data for PM(10), PM(2.5), NO(2), SO(2), CO, and ozone at all monitoring sites for 1979–2015. We also use ~350 geographic characteristics at each location including measures of traffic, land use, land cover, and satellite-based estimates of air pollution. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of geographic variables. For all pollutants and years, we compare three approaches for choosing variables to include in the PLS model: (1) no variables, (2) a limited number of variables selected from the full set by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional and spatially-clustered test data. Models using 3 to 30 variables selected from the full set generally have the best performance across all pollutants and years (median R(2) conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Concentration estimates for all Census Blocks reveal generally decreasing concentrations over several decades with local heterogeneity. Our findings suggest that national prediction models can be built by empirically selecting only a small number of important variables to provide robust concentration estimates. Model estimates are freely available online.
format Online
Article
Text
id pubmed-7028280
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-70282802020-02-27 Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression Kim, Sun-Young Bechle, Matthew Hankey, Steve Sheppard, Lianne Szpiro, Adam A. Marshall, Julian D. PLoS One Research Article National-scale empirical models for air pollution can include hundreds of geographic variables. The impact of model parsimony (i.e., how model performance differs for a large versus small number of covariates) has not been systematically explored. We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants during 1979–2015; (2) explore systematically the impact on model performance of the number of variables selected for inclusion in a model; and (3) provide publicly available model predictions. We compute annual-average concentrations from regulatory monitoring data for PM(10), PM(2.5), NO(2), SO(2), CO, and ozone at all monitoring sites for 1979–2015. We also use ~350 geographic characteristics at each location including measures of traffic, land use, land cover, and satellite-based estimates of air pollution. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of geographic variables. For all pollutants and years, we compare three approaches for choosing variables to include in the PLS model: (1) no variables, (2) a limited number of variables selected from the full set by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional and spatially-clustered test data. Models using 3 to 30 variables selected from the full set generally have the best performance across all pollutants and years (median R(2) conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Concentration estimates for all Census Blocks reveal generally decreasing concentrations over several decades with local heterogeneity. Our findings suggest that national prediction models can be built by empirically selecting only a small number of important variables to provide robust concentration estimates. Model estimates are freely available online. Public Library of Science 2020-02-18 /pmc/articles/PMC7028280/ /pubmed/32069301 http://dx.doi.org/10.1371/journal.pone.0228535 Text en © 2020 Kim et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Kim, Sun-Young
Bechle, Matthew
Hankey, Steve
Sheppard, Lianne
Szpiro, Adam A.
Marshall, Julian D.
Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression
title Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression
title_full Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression
title_fullStr Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression
title_full_unstemmed Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression
title_short Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of prediction model parsimony in integrated empirical geographic regression
title_sort concentrations of criteria pollutants in the contiguous u.s., 1979 – 2015: role of prediction model parsimony in integrated empirical geographic regression
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7028280/
https://www.ncbi.nlm.nih.gov/pubmed/32069301
http://dx.doi.org/10.1371/journal.pone.0228535
work_keys_str_mv AT kimsunyoung concentrationsofcriteriapollutantsinthecontiguousus19792015roleofpredictionmodelparsimonyinintegratedempiricalgeographicregression
AT bechlematthew concentrationsofcriteriapollutantsinthecontiguousus19792015roleofpredictionmodelparsimonyinintegratedempiricalgeographicregression
AT hankeysteve concentrationsofcriteriapollutantsinthecontiguousus19792015roleofpredictionmodelparsimonyinintegratedempiricalgeographicregression
AT sheppardlianne concentrationsofcriteriapollutantsinthecontiguousus19792015roleofpredictionmodelparsimonyinintegratedempiricalgeographicregression
AT szpiroadama concentrationsofcriteriapollutantsinthecontiguousus19792015roleofpredictionmodelparsimonyinintegratedempiricalgeographicregression
AT marshalljuliand concentrationsofcriteriapollutantsinthecontiguousus19792015roleofpredictionmodelparsimonyinintegratedempiricalgeographicregression