Cargando…

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes

In life sciences, accompanied by the rapid growth of sequencing technology and the advancement of research, vast amounts of data are being generated. It is known that as the size of Resource Description Framework (RDF) datasets increases, the more efficient loading to triple stores is crucial. For e...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yamaguchi, Atsuko, Yamamoto, Yasunori
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6548388/ https://www.ncbi.nlm.nih.gov/pubmed/31163073 http://dx.doi.org/10.1371/journal.pone.0217852

_version_	1783423837747019776
author	Yamaguchi, Atsuko Yamamoto, Yasunori
author_facet	Yamaguchi, Atsuko Yamamoto, Yasunori
author_sort	Yamaguchi, Atsuko
collection	PubMed
description	In life sciences, accompanied by the rapid growth of sequencing technology and the advancement of research, vast amounts of data are being generated. It is known that as the size of Resource Description Framework (RDF) datasets increases, the more efficient loading to triple stores is crucial. For example, UniProt’s RDF version contains 44 billion triples as of December 2018. PubChem also has an RDF dataset with 137 billion triples. As data sizes become extremely large, loading them to a triple store consumes time. To improve the efficiency of this task, parallel loading has been recommended for several stores. However, with parallel loading, dataset consistency must be considered if the dataset contains blank nodes. By definition, blank nodes do not have global identifiers; thus, pairs of identical blank nodes in the original dataset are recognized as different if they reside in separate files after the dataset is split for parallel loading. To address this issue, we propose the Split4Blank tool, which splits a dataset into multiple files under the condition that identical blank nodes are not separated. The proposed tool uses connected component and multiprocessor scheduling algorithms and satisfies the above condition. Furthermore, to confirm the effectiveness of the proposed approach, we applied Split4Blank to two life sciences RDF datasets. In addition, we generated synthetic RDF datasets to evaluate scalability based on the properties of various graphs, such as a scale-free and random graph.
format	Online Article Text
id	pubmed-6548388
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-65483882019-06-17 Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes Yamaguchi, Atsuko Yamamoto, Yasunori PLoS One Research Article In life sciences, accompanied by the rapid growth of sequencing technology and the advancement of research, vast amounts of data are being generated. It is known that as the size of Resource Description Framework (RDF) datasets increases, the more efficient loading to triple stores is crucial. For example, UniProt’s RDF version contains 44 billion triples as of December 2018. PubChem also has an RDF dataset with 137 billion triples. As data sizes become extremely large, loading them to a triple store consumes time. To improve the efficiency of this task, parallel loading has been recommended for several stores. However, with parallel loading, dataset consistency must be considered if the dataset contains blank nodes. By definition, blank nodes do not have global identifiers; thus, pairs of identical blank nodes in the original dataset are recognized as different if they reside in separate files after the dataset is split for parallel loading. To address this issue, we propose the Split4Blank tool, which splits a dataset into multiple files under the condition that identical blank nodes are not separated. The proposed tool uses connected component and multiprocessor scheduling algorithms and satisfies the above condition. Furthermore, to confirm the effectiveness of the proposed approach, we applied Split4Blank to two life sciences RDF datasets. In addition, we generated synthetic RDF datasets to evaluate scalability based on the properties of various graphs, such as a scale-free and random graph. Public Library of Science 2019-06-04 /pmc/articles/PMC6548388/ /pubmed/31163073 http://dx.doi.org/10.1371/journal.pone.0217852 Text en © 2019 Yamaguchi, Yamamoto http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Yamaguchi, Atsuko Yamamoto, Yasunori Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes
title	Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes
title_full	Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes
title_fullStr	Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes
title_full_unstemmed	Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes
title_short	Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes
title_sort	split4blank: maintaining consistency while improving efficiency of loading rdf data with blank nodes
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6548388/ https://www.ncbi.nlm.nih.gov/pubmed/31163073 http://dx.doi.org/10.1371/journal.pone.0217852
work_keys_str_mv	AT yamaguchiatsuko split4blankmaintainingconsistencywhileimprovingefficiencyofloadingrdfdatawithblanknodes AT yamamotoyasunori split4blankmaintainingconsistencywhileimprovingefficiencyofloadingrdfdatawithblanknodes

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes

Ejemplares similares