Cargando…
Multi-source dataset of e-commerce products with attributes for property matching
Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matchin...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8847803/ https://www.ncbi.nlm.nih.gov/pubmed/35198667 http://dx.doi.org/10.1016/j.dib.2022.107884 |
_version_ | 1784652125172662272 |
---|---|
author | Ayala, Daniel Hernández, Inma Ruiz, David Rahm, Erhard |
author_facet | Ayala, Daniel Hernández, Inma Ruiz, David Rahm, Erhard |
author_sort | Ayala, Daniel |
collection | PubMed |
description | Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matching of data properties, which attempts to try to find correspondences between the attributes of the entities. This is challenging due to the at times different names of equivalent properties. Furthermore, some properties may not be equivalent, but still match in 1..n relationships. These difficulties create the need for varied evaluation datasets for two reasons. First, they are needed to evaluate existing techniques in a variety of scenarios. Second, they enable the training of supervised techniques that may even become context-independent if trained with data from diverse enough contexts. To support the evaluation and training of data property matching techniques, we present a collection dataset consisting of product records from four different contexts. These datasets are the result of transforming two different existing datasets. In one of the datasets, some properties were filtered for being too noisy. The resulting processed dataset consists of json files with a listing of the product records and their properties, and a separate grouping of the properties that determines which ones match. It contains information about 2860 entities, with 4386 properties and 13350 pairwise matches. |
format | Online Article Text |
id | pubmed-8847803 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-88478032022-02-22 Multi-source dataset of e-commerce products with attributes for property matching Ayala, Daniel Hernández, Inma Ruiz, David Rahm, Erhard Data Brief Data Article Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matching of data properties, which attempts to try to find correspondences between the attributes of the entities. This is challenging due to the at times different names of equivalent properties. Furthermore, some properties may not be equivalent, but still match in 1..n relationships. These difficulties create the need for varied evaluation datasets for two reasons. First, they are needed to evaluate existing techniques in a variety of scenarios. Second, they enable the training of supervised techniques that may even become context-independent if trained with data from diverse enough contexts. To support the evaluation and training of data property matching techniques, we present a collection dataset consisting of product records from four different contexts. These datasets are the result of transforming two different existing datasets. In one of the datasets, some properties were filtered for being too noisy. The resulting processed dataset consists of json files with a listing of the product records and their properties, and a separate grouping of the properties that determines which ones match. It contains information about 2860 entities, with 4386 properties and 13350 pairwise matches. Elsevier 2022-02-02 /pmc/articles/PMC8847803/ /pubmed/35198667 http://dx.doi.org/10.1016/j.dib.2022.107884 Text en © 2022 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Ayala, Daniel Hernández, Inma Ruiz, David Rahm, Erhard Multi-source dataset of e-commerce products with attributes for property matching |
title | Multi-source dataset of e-commerce products with attributes for property matching |
title_full | Multi-source dataset of e-commerce products with attributes for property matching |
title_fullStr | Multi-source dataset of e-commerce products with attributes for property matching |
title_full_unstemmed | Multi-source dataset of e-commerce products with attributes for property matching |
title_short | Multi-source dataset of e-commerce products with attributes for property matching |
title_sort | multi-source dataset of e-commerce products with attributes for property matching |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8847803/ https://www.ncbi.nlm.nih.gov/pubmed/35198667 http://dx.doi.org/10.1016/j.dib.2022.107884 |
work_keys_str_mv | AT ayaladaniel multisourcedatasetofecommerceproductswithattributesforpropertymatching AT hernandezinma multisourcedatasetofecommerceproductswithattributesforpropertymatching AT ruizdavid multisourcedatasetofecommerceproductswithattributesforpropertymatching AT rahmerhard multisourcedatasetofecommerceproductswithattributesforpropertymatching |