Cargando…

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams...

Descripción completa

Detalles Bibliográficos
Autores principales:	AlQabbany, Abdulaziz O., Azmi, Aqil M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8305386/ https://www.ncbi.nlm.nih.gov/pubmed/34356400 http://dx.doi.org/10.3390/e23070859

_version_	1783727562277519360
author	AlQabbany, Abdulaziz O. Azmi, Aqil M.
author_facet	AlQabbany, Abdulaziz O. Azmi, Aqil M.
author_sort	AlQabbany, Abdulaziz O.
collection	PubMed
description	We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson [Formula: see text] distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness ([Formula: see text]), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter [Formula: see text] of the Poisson distribution that yields the best value for [Formula: see text]. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.
format	Online Article Text
id	pubmed-8305386
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-83053862021-07-25 Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams AlQabbany, Abdulaziz O. Azmi, Aqil M. Entropy (Basel) Article We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson [Formula: see text] distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness ([Formula: see text]), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter [Formula: see text] of the Poisson distribution that yields the best value for [Formula: see text]. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations. MDPI 2021-07-04 /pmc/articles/PMC8305386/ /pubmed/34356400 http://dx.doi.org/10.3390/e23070859 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article AlQabbany, Abdulaziz O. Azmi, Aqil M. Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams
title	Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams
title_full	Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams
title_fullStr	Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams
title_full_unstemmed	Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams
title_short	Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams
title_sort	measuring the effectiveness of adaptive random forest for handling concept drift in big data streams
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8305386/ https://www.ncbi.nlm.nih.gov/pubmed/34356400 http://dx.doi.org/10.3390/e23070859
work_keys_str_mv	AT alqabbanyabdulazizo measuringtheeffectivenessofadaptiverandomforestforhandlingconceptdriftinbigdatastreams AT azmiaqilm measuringtheeffectivenessofadaptiverandomforestforhandlingconceptdriftinbigdatastreams

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Ejemplares similares