Cargando…

Efficient processing of complex XSD using Hive and Spark

The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension base...

Descripción completa

Detalles Bibliográficos
Autores principales:	Martinez-Mosquera, Diana, Navarrete, Rosa, Luján-Mora, Sergio
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2021
Materias:	Algorithms and Analysis of Algorithms
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8384044/ https://www.ncbi.nlm.nih.gov/pubmed/34497870 http://dx.doi.org/10.7717/peerj-cs.652

_version_	1783741844912340992
author	Martinez-Mosquera, Diana Navarrete, Rosa Luján-Mora, Sergio
author_facet	Martinez-Mosquera, Diana Navarrete, Rosa Luján-Mora, Sergio
author_sort	Martinez-Mosquera, Diana
collection	PubMed
description	The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.
format	Online Article Text
id	pubmed-8384044
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-83840442021-09-07 Efficient processing of complex XSD using Hive and Spark Martinez-Mosquera, Diana Navarrete, Rosa Luján-Mora, Sergio PeerJ Comput Sci Algorithms and Analysis of Algorithms The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks. PeerJ Inc. 2021-08-17 /pmc/articles/PMC8384044/ /pubmed/34497870 http://dx.doi.org/10.7717/peerj-cs.652 Text en ©2021 Martinez-Mosquera et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Algorithms and Analysis of Algorithms Martinez-Mosquera, Diana Navarrete, Rosa Luján-Mora, Sergio Efficient processing of complex XSD using Hive and Spark
title	Efficient processing of complex XSD using Hive and Spark
title_full	Efficient processing of complex XSD using Hive and Spark
title_fullStr	Efficient processing of complex XSD using Hive and Spark
title_full_unstemmed	Efficient processing of complex XSD using Hive and Spark
title_short	Efficient processing of complex XSD using Hive and Spark
title_sort	efficient processing of complex xsd using hive and spark
topic	Algorithms and Analysis of Algorithms
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8384044/ https://www.ncbi.nlm.nih.gov/pubmed/34497870 http://dx.doi.org/10.7717/peerj-cs.652
work_keys_str_mv	AT martinezmosqueradiana efficientprocessingofcomplexxsdusinghiveandspark AT navarreterosa efficientprocessingofcomplexxsdusinghiveandspark AT lujanmorasergio efficientprocessingofcomplexxsdusinghiveandspark

Efficient processing of complex XSD using Hive and Spark

Ejemplares similares