Cargando…

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated dat...

Descripción completa

Detalles Bibliográficos
Autores principales: Junaid, Muhammad, Ali, Sajid, Siddiqui, Isma Farah, Nam, Choonsung, Qureshi, Nawab Muhammad Faseeh, Kim, Jaehyoun, Shin, Dong Ryeol
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer US 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9396610/
https://www.ncbi.nlm.nih.gov/pubmed/36033548
http://dx.doi.org/10.1007/s11277-021-09362-7
_version_ 1784771965571039232
author Junaid, Muhammad
Ali, Sajid
Siddiqui, Isma Farah
Nam, Choonsung
Qureshi, Nawab Muhammad Faseeh
Kim, Jaehyoun
Shin, Dong Ryeol
author_facet Junaid, Muhammad
Ali, Sajid
Siddiqui, Isma Farah
Nam, Choonsung
Qureshi, Nawab Muhammad Faseeh
Kim, Jaehyoun
Shin, Dong Ryeol
author_sort Junaid, Muhammad
collection PubMed
description Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform’s qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used ”SUSY,” ”HIGGS,” ”BANK,” and ”HEPMASS” dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions.
format Online
Article
Text
id pubmed-9396610
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer US
record_format MEDLINE/PubMed
spelling pubmed-93966102022-08-23 Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem Junaid, Muhammad Ali, Sajid Siddiqui, Isma Farah Nam, Choonsung Qureshi, Nawab Muhammad Faseeh Kim, Jaehyoun Shin, Dong Ryeol Wirel Pers Commun Article Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform’s qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used ”SUSY,” ”HIGGS,” ”BANK,” and ”HEPMASS” dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions. Springer US 2022-08-23 2022 /pmc/articles/PMC9396610/ /pubmed/36033548 http://dx.doi.org/10.1007/s11277-021-09362-7 Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022, corrected publicationSpringer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Junaid, Muhammad
Ali, Sajid
Siddiqui, Isma Farah
Nam, Choonsung
Qureshi, Nawab Muhammad Faseeh
Kim, Jaehyoun
Shin, Dong Ryeol
Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem
title Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem
title_full Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem
title_fullStr Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem
title_full_unstemmed Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem
title_short Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem
title_sort performance evaluation of data-driven intelligent algorithms for big data ecosystem
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9396610/
https://www.ncbi.nlm.nih.gov/pubmed/36033548
http://dx.doi.org/10.1007/s11277-021-09362-7
work_keys_str_mv AT junaidmuhammad performanceevaluationofdatadrivenintelligentalgorithmsforbigdataecosystem
AT alisajid performanceevaluationofdatadrivenintelligentalgorithmsforbigdataecosystem
AT siddiquiismafarah performanceevaluationofdatadrivenintelligentalgorithmsforbigdataecosystem
AT namchoonsung performanceevaluationofdatadrivenintelligentalgorithmsforbigdataecosystem
AT qureshinawabmuhammadfaseeh performanceevaluationofdatadrivenintelligentalgorithmsforbigdataecosystem
AT kimjaehyoun performanceevaluationofdatadrivenintelligentalgorithmsforbigdataecosystem
AT shindongryeol performanceevaluationofdatadrivenintelligentalgorithmsforbigdataecosystem