Cargando…

CatBoost for big data: an interdisciplinary review

Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a me...

Descripción completa

Detalles Bibliográficos
Autores principales: Hancock, John T., Khoshgoftaar, Taghi M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610170/
https://www.ncbi.nlm.nih.gov/pubmed/33169094
http://dx.doi.org/10.1186/s40537-020-00369-8
_version_ 1783605148298248192
author Hancock, John T.
Khoshgoftaar, Taghi M.
author_facet Hancock, John T.
Khoshgoftaar, Taghi M.
author_sort Hancock, John T.
collection PubMed
description Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.
format Online
Article
Text
id pubmed-7610170
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-76101702020-11-05 CatBoost for big data: an interdisciplinary review Hancock, John T. Khoshgoftaar, Taghi M. J Big Data Survey Paper Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication. Springer International Publishing 2020-11-04 2020 /pmc/articles/PMC7610170/ /pubmed/33169094 http://dx.doi.org/10.1186/s40537-020-00369-8 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Survey Paper
Hancock, John T.
Khoshgoftaar, Taghi M.
CatBoost for big data: an interdisciplinary review
title CatBoost for big data: an interdisciplinary review
title_full CatBoost for big data: an interdisciplinary review
title_fullStr CatBoost for big data: an interdisciplinary review
title_full_unstemmed CatBoost for big data: an interdisciplinary review
title_short CatBoost for big data: an interdisciplinary review
title_sort catboost for big data: an interdisciplinary review
topic Survey Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7610170/
https://www.ncbi.nlm.nih.gov/pubmed/33169094
http://dx.doi.org/10.1186/s40537-020-00369-8
work_keys_str_mv AT hancockjohnt catboostforbigdataaninterdisciplinaryreview
AT khoshgoftaartaghim catboostforbigdataaninterdisciplinaryreview