Cargando…

Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model

This study conducted a comprehensive analysis of multiple supervised machine learning models, regressors and classifiers, to accurately predict diamond prices. Diamond pricing is a complex task due to the non-linear relationships between key features such as carat, cut, clarity, table, and depth. Th...

Descripción completa

Detalles Bibliográficos
Autores principales: Kigo, Samuel Njoroge, Omondi, Evans Otieno, Omolo, Bernard Oguna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570374/
https://www.ncbi.nlm.nih.gov/pubmed/37828360
http://dx.doi.org/10.1038/s41598-023-44326-w
_version_ 1785119752691122176
author Kigo, Samuel Njoroge
Omondi, Evans Otieno
Omolo, Bernard Oguna
author_facet Kigo, Samuel Njoroge
Omondi, Evans Otieno
Omolo, Bernard Oguna
author_sort Kigo, Samuel Njoroge
collection PubMed
description This study conducted a comprehensive analysis of multiple supervised machine learning models, regressors and classifiers, to accurately predict diamond prices. Diamond pricing is a complex task due to the non-linear relationships between key features such as carat, cut, clarity, table, and depth. The analysis aimed to develop an accurate predictive model by utilizing both regression and classification approaches. To preprocess the data, the study employed various techniques. The work addressed outliers, standardized the predictors, performed median imputation of missing values, and resolved multicollinearity issues. Equal-width binning on the cut variable was performed to handle class imbalance. Correlation-based feature selection was utilized to eliminate highly correlated variables, ensuring that only relevant features were included in the models. Outliers were handled using the inter-quartile range method, and numerical features were normalized through standardization. Missing values in numerical features were imputed using the median, preserving the integrity of the dataset. Among the models evaluated, the RF regressor exhibited exceptional performance. It achieved the lowest root mean squared error (RMSE) of 523.50, indicating superior accuracy compared to the other models. The RF regressor also obtained a high R-squared ([Formula: see text] ) score of 0.985, suggesting it explained a significant portion of the variance in diamond prices. Furthermore, the area under the curve with RF classifier for the test set was 1.00 [Formula: see text] , indicating perfect classification performance. These results solidify the RF’s position as the best-performing model in terms of accuracy and predictive power, both in regression and classification. The MLP regressor showed promising results with an RMSE of 563.74 and an [Formula: see text] score of 0.980, demonstrating its ability to capture the complex relationships in the data. Although it achieved slightly higher errors than the RF regressor, further analysis is needed to determine its suitability and potential advantages compared to the RF regressor. The XGBoost Regressor achieved an RMSE of 612.88 and an [Formula: see text] score of 0.972, indicating its effectiveness in predicting diamond prices but with slightly higher errors compared to the RF regressor. The Boosted Decision Tree Regressor had an RMSE of 711.31 and an [Formula: see text] score of 0.968, demonstrating its ability to capture some of the underlying patterns but with higher errors than the RF and XGBoost models. In contrast, the KNN regressor yielded a higher RMSE of 1346.65 and a lower [Formula: see text] score of 0.887, indicating its inferior performance in accurately predicting diamond prices compared to the other models. Similarly, the Linear Regression model performed similarly to the KNN regressor, with an RMSE of 1395.41 and an [Formula: see text] score of 0.876. The Support Vector Regression model showed the highest RMSE of 3044.49 and the lowest [Formula: see text] score of 0.421, indicating its limited effectiveness in capturing the complex relationships in the data. Overall, the study demonstrates that the RF outperforms the other models in terms of accuracy and predictive power, as evidenced by its lowest RMSE, highest [Formula: see text] score, and perfect classification performance. This highlights its suitability for accurately predicting diamond prices. The study not only provides an effective tool for the diamond industry but also emphasizes the importance of considering both regression and classification approaches in developing accurate predictive models. The findings contribute valuable insights for pricing strategies, market trends, and decision-making processes in the diamond industry and related fields.
format Online
Article
Text
id pubmed-10570374
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-105703742023-10-14 Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model Kigo, Samuel Njoroge Omondi, Evans Otieno Omolo, Bernard Oguna Sci Rep Article This study conducted a comprehensive analysis of multiple supervised machine learning models, regressors and classifiers, to accurately predict diamond prices. Diamond pricing is a complex task due to the non-linear relationships between key features such as carat, cut, clarity, table, and depth. The analysis aimed to develop an accurate predictive model by utilizing both regression and classification approaches. To preprocess the data, the study employed various techniques. The work addressed outliers, standardized the predictors, performed median imputation of missing values, and resolved multicollinearity issues. Equal-width binning on the cut variable was performed to handle class imbalance. Correlation-based feature selection was utilized to eliminate highly correlated variables, ensuring that only relevant features were included in the models. Outliers were handled using the inter-quartile range method, and numerical features were normalized through standardization. Missing values in numerical features were imputed using the median, preserving the integrity of the dataset. Among the models evaluated, the RF regressor exhibited exceptional performance. It achieved the lowest root mean squared error (RMSE) of 523.50, indicating superior accuracy compared to the other models. The RF regressor also obtained a high R-squared ([Formula: see text] ) score of 0.985, suggesting it explained a significant portion of the variance in diamond prices. Furthermore, the area under the curve with RF classifier for the test set was 1.00 [Formula: see text] , indicating perfect classification performance. These results solidify the RF’s position as the best-performing model in terms of accuracy and predictive power, both in regression and classification. The MLP regressor showed promising results with an RMSE of 563.74 and an [Formula: see text] score of 0.980, demonstrating its ability to capture the complex relationships in the data. Although it achieved slightly higher errors than the RF regressor, further analysis is needed to determine its suitability and potential advantages compared to the RF regressor. The XGBoost Regressor achieved an RMSE of 612.88 and an [Formula: see text] score of 0.972, indicating its effectiveness in predicting diamond prices but with slightly higher errors compared to the RF regressor. The Boosted Decision Tree Regressor had an RMSE of 711.31 and an [Formula: see text] score of 0.968, demonstrating its ability to capture some of the underlying patterns but with higher errors than the RF and XGBoost models. In contrast, the KNN regressor yielded a higher RMSE of 1346.65 and a lower [Formula: see text] score of 0.887, indicating its inferior performance in accurately predicting diamond prices compared to the other models. Similarly, the Linear Regression model performed similarly to the KNN regressor, with an RMSE of 1395.41 and an [Formula: see text] score of 0.876. The Support Vector Regression model showed the highest RMSE of 3044.49 and the lowest [Formula: see text] score of 0.421, indicating its limited effectiveness in capturing the complex relationships in the data. Overall, the study demonstrates that the RF outperforms the other models in terms of accuracy and predictive power, as evidenced by its lowest RMSE, highest [Formula: see text] score, and perfect classification performance. This highlights its suitability for accurately predicting diamond prices. The study not only provides an effective tool for the diamond industry but also emphasizes the importance of considering both regression and classification approaches in developing accurate predictive models. The findings contribute valuable insights for pricing strategies, market trends, and decision-making processes in the diamond industry and related fields. Nature Publishing Group UK 2023-10-12 /pmc/articles/PMC10570374/ /pubmed/37828360 http://dx.doi.org/10.1038/s41598-023-44326-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Kigo, Samuel Njoroge
Omondi, Evans Otieno
Omolo, Bernard Oguna
Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
title Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
title_full Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
title_fullStr Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
title_full_unstemmed Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
title_short Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
title_sort assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570374/
https://www.ncbi.nlm.nih.gov/pubmed/37828360
http://dx.doi.org/10.1038/s41598-023-44326-w
work_keys_str_mv AT kigosamuelnjoroge assessingpredictiveperformanceofsupervisedmachinelearningalgorithmsforadiamondpricingmodel
AT omondievansotieno assessingpredictiveperformanceofsupervisedmachinelearningalgorithmsforadiamondpricingmodel
AT omolobernardoguna assessingpredictiveperformanceofsupervisedmachinelearningalgorithmsforadiamondpricingmodel