Cargando…

Study becomes insight: Ecological learning from machine learning

1. The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental ‘drivers’ is less straightforwar...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yu, Qiuyan, Ji, Wenjie, Prihodko, Lara, Ross, C. Wade, Anchang, Julius Y., Hanan, Niall P.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	John Wiley and Sons Inc. 2021
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9292299/ https://www.ncbi.nlm.nih.gov/pubmed/35874972 http://dx.doi.org/10.1111/2041-210X.13686

_version_	1784749336966463488
author	Yu, Qiuyan Ji, Wenjie Prihodko, Lara Ross, C. Wade Anchang, Julius Y. Hanan, Niall P.
author_facet	Yu, Qiuyan Ji, Wenjie Prihodko, Lara Ross, C. Wade Anchang, Julius Y. Hanan, Niall P.
author_sort	Yu, Qiuyan
collection	PubMed
description	1. The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental ‘drivers’ is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the ‘learning’ hidden in the ML models. 2. We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi‐variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables. 3. We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non‐influential variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three‐dimensional visualizations and use of loess planes to represent independent variable effects and interactions. 4. Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to ‘learn from machine learning’.
format	Online Article Text
id	pubmed-9292299
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	John Wiley and Sons Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-92922992022-07-20 Study becomes insight: Ecological learning from machine learning Yu, Qiuyan Ji, Wenjie Prihodko, Lara Ross, C. Wade Anchang, Julius Y. Hanan, Niall P. Methods Ecol Evol Research Articles 1. The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental ‘drivers’ is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the ‘learning’ hidden in the ML models. 2. We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi‐variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables. 3. We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non‐influential variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three‐dimensional visualizations and use of loess planes to represent independent variable effects and interactions. 4. Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to ‘learn from machine learning’. John Wiley and Sons Inc. 2021-08-06 2021-11 /pmc/articles/PMC9292299/ /pubmed/35874972 http://dx.doi.org/10.1111/2041-210X.13686 Text en © 2021 The Authors. Methods in Ecology and Evolution published by John Wiley & Sons Ltd on behalf of British Ecological Society https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
spellingShingle	Research Articles Yu, Qiuyan Ji, Wenjie Prihodko, Lara Ross, C. Wade Anchang, Julius Y. Hanan, Niall P. Study becomes insight: Ecological learning from machine learning
title	Study becomes insight: Ecological learning from machine learning
title_full	Study becomes insight: Ecological learning from machine learning
title_fullStr	Study becomes insight: Ecological learning from machine learning
title_full_unstemmed	Study becomes insight: Ecological learning from machine learning
title_short	Study becomes insight: Ecological learning from machine learning
title_sort	study becomes insight: ecological learning from machine learning
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9292299/ https://www.ncbi.nlm.nih.gov/pubmed/35874972 http://dx.doi.org/10.1111/2041-210X.13686
work_keys_str_mv	AT yuqiuyan studybecomesinsightecologicallearningfrommachinelearning AT jiwenjie studybecomesinsightecologicallearningfrommachinelearning AT prihodkolara studybecomesinsightecologicallearningfrommachinelearning AT rosscwade studybecomesinsightecologicallearningfrommachinelearning AT anchangjuliusy studybecomesinsightecologicallearningfrommachinelearning AT hananniallp studybecomesinsightecologicallearningfrommachinelearning

Study becomes insight: Ecological learning from machine learning

Ejemplares similares