Cargando…

A distributed computing model for big data anonymization in the networks

Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the sc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ashkouti, Farough, Khamforoosh, Keyhan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10146481/ https://www.ncbi.nlm.nih.gov/pubmed/37115783 http://dx.doi.org/10.1371/journal.pone.0285212

_version_	1785034591315165184
author	Ashkouti, Farough Khamforoosh, Keyhan
author_facet	Ashkouti, Farough Khamforoosh, Keyhan
author_sort	Ashkouti, Farough
collection	PubMed
description	Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.
format	Online Article Text
id	pubmed-10146481
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-101464812023-04-29 A distributed computing model for big data anonymization in the networks Ashkouti, Farough Khamforoosh, Keyhan PLoS One Research Article Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches. Public Library of Science 2023-04-28 /pmc/articles/PMC10146481/ /pubmed/37115783 http://dx.doi.org/10.1371/journal.pone.0285212 Text en © 2023 Ashkouti, Khamforoosh https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Ashkouti, Farough Khamforoosh, Keyhan A distributed computing model for big data anonymization in the networks
title	A distributed computing model for big data anonymization in the networks
title_full	A distributed computing model for big data anonymization in the networks
title_fullStr	A distributed computing model for big data anonymization in the networks
title_full_unstemmed	A distributed computing model for big data anonymization in the networks
title_short	A distributed computing model for big data anonymization in the networks
title_sort	distributed computing model for big data anonymization in the networks
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10146481/ https://www.ncbi.nlm.nih.gov/pubmed/37115783 http://dx.doi.org/10.1371/journal.pone.0285212
work_keys_str_mv	AT ashkoutifarough adistributedcomputingmodelforbigdataanonymizationinthenetworks AT khamforooshkeyhan adistributedcomputingmodelforbigdataanonymizationinthenetworks AT ashkoutifarough distributedcomputingmodelforbigdataanonymizationinthenetworks AT khamforooshkeyhan distributedcomputingmodelforbigdataanonymizationinthenetworks

A distributed computing model for big data anonymization in the networks

Ejemplares similares