Cargando…

Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Hengl, Tomislav, Nussbaum, Madlene, Wright, Marvin N., Heuvelink, Gerard B.M., Gräler, Benedikt
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6119462/
https://www.ncbi.nlm.nih.gov/pubmed/30186691
http://dx.doi.org/10.7717/peerj.5518
_version_ 1783352090726236160
author Hengl, Tomislav
Nussbaum, Madlene
Wright, Marvin N.
Heuvelink, Gerard B.M.
Gräler, Benedikt
author_facet Hengl, Tomislav
Nussbaum, Madlene
Wright, Marvin N.
Heuvelink, Gerard B.M.
Gräler, Benedikt
author_sort Hengl, Tomislav
collection PubMed
description Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as “knowledge engines” in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality—especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.
format Online
Article
Text
id pubmed-6119462
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-61194622018-09-05 Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables Hengl, Tomislav Nussbaum, Madlene Wright, Marvin N. Heuvelink, Gerard B.M. Gräler, Benedikt PeerJ Biogeography Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as “knowledge engines” in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality—especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp. PeerJ Inc. 2018-08-29 /pmc/articles/PMC6119462/ /pubmed/30186691 http://dx.doi.org/10.7717/peerj.5518 Text en ©2018 Hengl et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Biogeography
Hengl, Tomislav
Nussbaum, Madlene
Wright, Marvin N.
Heuvelink, Gerard B.M.
Gräler, Benedikt
Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables
title Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables
title_full Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables
title_fullStr Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables
title_full_unstemmed Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables
title_short Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables
title_sort random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables
topic Biogeography
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6119462/
https://www.ncbi.nlm.nih.gov/pubmed/30186691
http://dx.doi.org/10.7717/peerj.5518
work_keys_str_mv AT hengltomislav randomforestasagenericframeworkforpredictivemodelingofspatialandspatiotemporalvariables
AT nussbaummadlene randomforestasagenericframeworkforpredictivemodelingofspatialandspatiotemporalvariables
AT wrightmarvinn randomforestasagenericframeworkforpredictivemodelingofspatialandspatiotemporalvariables
AT heuvelinkgerardbm randomforestasagenericframeworkforpredictivemodelingofspatialandspatiotemporalvariables
AT gralerbenedikt randomforestasagenericframeworkforpredictivemodelingofspatialandspatiotemporalvariables