Cargando…

The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation

Regression analysis makes up a large part of supervised machine learning, and consists of the prediction of a continuous independent target from a set of other predictor variables. The difference between binary classification and regression is in the target range: in binary classification, the targe...

Descripción completa

Detalles Bibliográficos
Autores principales: Chicco, Davide, Warrens, Matthijs J., Jurman, Giuseppe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8279135/
https://www.ncbi.nlm.nih.gov/pubmed/34307865
http://dx.doi.org/10.7717/peerj-cs.623
_version_ 1783722394893942784
author Chicco, Davide
Warrens, Matthijs J.
Jurman, Giuseppe
author_facet Chicco, Davide
Warrens, Matthijs J.
Jurman, Giuseppe
author_sort Chicco, Davide
collection PubMed
description Regression analysis makes up a large part of supervised machine learning, and consists of the prediction of a continuous independent target from a set of other predictor variables. The difference between binary classification and regression is in the target range: in binary classification, the target can have only two values (usually encoded as 0 and 1), while in regression the target can have multiple values. Even if regression analysis has been employed in a huge number of machine learning studies, no consensus has been reached on a single, unified, standard metric to assess the results of the regression itself. Many studies employ the mean square error (MSE) and its rooted variant (RMSE), or the mean absolute error (MAE) and its percentage variant (MAPE). Although useful, these rates share a common drawback: since their values can range between zero and +infinity, a single value of them does not say much about the performance of the regression with respect to the distribution of the ground truth elements. In this study, we focus on two rates that actually generate a high score only if the majority of the elements of a ground truth group has been correctly predicted: the coefficient of determination (also known as R-squared or R(2)) and the symmetric mean absolute percentage error (SMAPE). After showing their mathematical properties, we report a comparison between R(2) and SMAPE in several use cases and in two real medical scenarios. Our results demonstrate that the coefficient of determination (R-squared) is more informative and truthful than SMAPE, and does not have the interpretability limitations of MSE, RMSE, MAE and MAPE. We therefore suggest the usage of R-squared as standard metric to evaluate regression analyses in any scientific domain.
format Online
Article
Text
id pubmed-8279135
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-82791352021-07-22 The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation Chicco, Davide Warrens, Matthijs J. Jurman, Giuseppe PeerJ Comput Sci Data Mining and Machine Learning Regression analysis makes up a large part of supervised machine learning, and consists of the prediction of a continuous independent target from a set of other predictor variables. The difference between binary classification and regression is in the target range: in binary classification, the target can have only two values (usually encoded as 0 and 1), while in regression the target can have multiple values. Even if regression analysis has been employed in a huge number of machine learning studies, no consensus has been reached on a single, unified, standard metric to assess the results of the regression itself. Many studies employ the mean square error (MSE) and its rooted variant (RMSE), or the mean absolute error (MAE) and its percentage variant (MAPE). Although useful, these rates share a common drawback: since their values can range between zero and +infinity, a single value of them does not say much about the performance of the regression with respect to the distribution of the ground truth elements. In this study, we focus on two rates that actually generate a high score only if the majority of the elements of a ground truth group has been correctly predicted: the coefficient of determination (also known as R-squared or R(2)) and the symmetric mean absolute percentage error (SMAPE). After showing their mathematical properties, we report a comparison between R(2) and SMAPE in several use cases and in two real medical scenarios. Our results demonstrate that the coefficient of determination (R-squared) is more informative and truthful than SMAPE, and does not have the interpretability limitations of MSE, RMSE, MAE and MAPE. We therefore suggest the usage of R-squared as standard metric to evaluate regression analyses in any scientific domain. PeerJ Inc. 2021-07-05 /pmc/articles/PMC8279135/ /pubmed/34307865 http://dx.doi.org/10.7717/peerj-cs.623 Text en © 2021 Chicco et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Data Mining and Machine Learning
Chicco, Davide
Warrens, Matthijs J.
Jurman, Giuseppe
The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
title The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
title_full The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
title_fullStr The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
title_full_unstemmed The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
title_short The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
title_sort coefficient of determination r-squared is more informative than smape, mae, mape, mse and rmse in regression analysis evaluation
topic Data Mining and Machine Learning
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8279135/
https://www.ncbi.nlm.nih.gov/pubmed/34307865
http://dx.doi.org/10.7717/peerj-cs.623
work_keys_str_mv AT chiccodavide thecoefficientofdeterminationrsquaredismoreinformativethansmapemaemapemseandrmseinregressionanalysisevaluation
AT warrensmatthijsj thecoefficientofdeterminationrsquaredismoreinformativethansmapemaemapemseandrmseinregressionanalysisevaluation
AT jurmangiuseppe thecoefficientofdeterminationrsquaredismoreinformativethansmapemaemapemseandrmseinregressionanalysisevaluation
AT chiccodavide coefficientofdeterminationrsquaredismoreinformativethansmapemaemapemseandrmseinregressionanalysisevaluation
AT warrensmatthijsj coefficientofdeterminationrsquaredismoreinformativethansmapemaemapemseandrmseinregressionanalysisevaluation
AT jurmangiuseppe coefficientofdeterminationrsquaredismoreinformativethansmapemaemapemseandrmseinregressionanalysisevaluation