Cargando…

Multi-source dataset of e-commerce products with attributes for property matching

Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matchin...

Descripción completa

Detalles Bibliográficos
Autores principales: Ayala, Daniel, Hernández, Inma, Ruiz, David, Rahm, Erhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8847803/
https://www.ncbi.nlm.nih.gov/pubmed/35198667
http://dx.doi.org/10.1016/j.dib.2022.107884
_version_ 1784652125172662272
author Ayala, Daniel
Hernández, Inma
Ruiz, David
Rahm, Erhard
author_facet Ayala, Daniel
Hernández, Inma
Ruiz, David
Rahm, Erhard
author_sort Ayala, Daniel
collection PubMed
description Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matching of data properties, which attempts to try to find correspondences between the attributes of the entities. This is challenging due to the at times different names of equivalent properties. Furthermore, some properties may not be equivalent, but still match in 1..n relationships. These difficulties create the need for varied evaluation datasets for two reasons. First, they are needed to evaluate existing techniques in a variety of scenarios. Second, they enable the training of supervised techniques that may even become context-independent if trained with data from diverse enough contexts. To support the evaluation and training of data property matching techniques, we present a collection dataset consisting of product records from four different contexts. These datasets are the result of transforming two different existing datasets. In one of the datasets, some properties were filtered for being too noisy. The resulting processed dataset consists of json files with a listing of the product records and their properties, and a separate grouping of the properties that determines which ones match. It contains information about 2860 entities, with 4386 properties and 13350 pairwise matches.
format Online
Article
Text
id pubmed-8847803
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-88478032022-02-22 Multi-source dataset of e-commerce products with attributes for property matching Ayala, Daniel Hernández, Inma Ruiz, David Rahm, Erhard Data Brief Data Article Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matching of data properties, which attempts to try to find correspondences between the attributes of the entities. This is challenging due to the at times different names of equivalent properties. Furthermore, some properties may not be equivalent, but still match in 1..n relationships. These difficulties create the need for varied evaluation datasets for two reasons. First, they are needed to evaluate existing techniques in a variety of scenarios. Second, they enable the training of supervised techniques that may even become context-independent if trained with data from diverse enough contexts. To support the evaluation and training of data property matching techniques, we present a collection dataset consisting of product records from four different contexts. These datasets are the result of transforming two different existing datasets. In one of the datasets, some properties were filtered for being too noisy. The resulting processed dataset consists of json files with a listing of the product records and their properties, and a separate grouping of the properties that determines which ones match. It contains information about 2860 entities, with 4386 properties and 13350 pairwise matches. Elsevier 2022-02-02 /pmc/articles/PMC8847803/ /pubmed/35198667 http://dx.doi.org/10.1016/j.dib.2022.107884 Text en © 2022 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Ayala, Daniel
Hernández, Inma
Ruiz, David
Rahm, Erhard
Multi-source dataset of e-commerce products with attributes for property matching
title Multi-source dataset of e-commerce products with attributes for property matching
title_full Multi-source dataset of e-commerce products with attributes for property matching
title_fullStr Multi-source dataset of e-commerce products with attributes for property matching
title_full_unstemmed Multi-source dataset of e-commerce products with attributes for property matching
title_short Multi-source dataset of e-commerce products with attributes for property matching
title_sort multi-source dataset of e-commerce products with attributes for property matching
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8847803/
https://www.ncbi.nlm.nih.gov/pubmed/35198667
http://dx.doi.org/10.1016/j.dib.2022.107884
work_keys_str_mv AT ayaladaniel multisourcedatasetofecommerceproductswithattributesforpropertymatching
AT hernandezinma multisourcedatasetofecommerceproductswithattributesforpropertymatching
AT ruizdavid multisourcedatasetofecommerceproductswithattributesforpropertymatching
AT rahmerhard multisourcedatasetofecommerceproductswithattributesforpropertymatching