Cargando…

Using machine learning for crop yield prediction in the past or the future

The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance o...

Descripción completa

Detalles Bibliográficos
Autores principales: Morales, Alejandro, Villalobos, Francisco J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10097960/
https://www.ncbi.nlm.nih.gov/pubmed/37063228
http://dx.doi.org/10.3389/fpls.2023.1128388
_version_ 1785024685566590976
author Morales, Alejandro
Villalobos, Francisco J.
author_facet Morales, Alejandro
Villalobos, Francisco J.
author_sort Morales, Alejandro
collection PubMed
description The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study.
format Online
Article
Text
id pubmed-10097960
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-100979602023-04-14 Using machine learning for crop yield prediction in the past or the future Morales, Alejandro Villalobos, Francisco J. Front Plant Sci Plant Science The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study. Frontiers Media S.A. 2023-03-30 /pmc/articles/PMC10097960/ /pubmed/37063228 http://dx.doi.org/10.3389/fpls.2023.1128388 Text en Copyright © 2023 Morales and Villalobos https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Plant Science
Morales, Alejandro
Villalobos, Francisco J.
Using machine learning for crop yield prediction in the past or the future
title Using machine learning for crop yield prediction in the past or the future
title_full Using machine learning for crop yield prediction in the past or the future
title_fullStr Using machine learning for crop yield prediction in the past or the future
title_full_unstemmed Using machine learning for crop yield prediction in the past or the future
title_short Using machine learning for crop yield prediction in the past or the future
title_sort using machine learning for crop yield prediction in the past or the future
topic Plant Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10097960/
https://www.ncbi.nlm.nih.gov/pubmed/37063228
http://dx.doi.org/10.3389/fpls.2023.1128388
work_keys_str_mv AT moralesalejandro usingmachinelearningforcropyieldpredictioninthepastorthefuture
AT villalobosfranciscoj usingmachinelearningforcropyieldpredictioninthepastorthefuture