Cargando…

Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity metho...

Descripción completa

Detalles Bibliográficos
Autores principales: Lastra-Díaz, Juan J., Goikoetxea, Josu, Hadj Taieb, Mohamed Ali, García-Serrano, Ana, Aouicha, Mohamed Ben, Agirre, Eneko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736772/
https://www.ncbi.nlm.nih.gov/pubmed/31516953
http://dx.doi.org/10.1016/j.dib.2019.104432
_version_ 1783450552550555648
author Lastra-Díaz, Juan J.
Goikoetxea, Josu
Hadj Taieb, Mohamed Ali
García-Serrano, Ana
Aouicha, Mohamed Ben
Agirre, Eneko
author_facet Lastra-Díaz, Juan J.
Goikoetxea, Josu
Hadj Taieb, Mohamed Ali
García-Serrano, Ana
Aouicha, Mohamed Ben
Agirre, Eneko
author_sort Lastra-Díaz, Juan J.
collection PubMed
description This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks.
format Online
Article
Text
id pubmed-6736772
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-67367722019-09-12 Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko Data Brief Computer Science This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks. Elsevier 2019-08-26 /pmc/articles/PMC6736772/ /pubmed/31516953 http://dx.doi.org/10.1016/j.dib.2019.104432 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Computer Science
Lastra-Díaz, Juan J.
Goikoetxea, Josu
Hadj Taieb, Mohamed Ali
García-Serrano, Ana
Aouicha, Mohamed Ben
Agirre, Eneko
Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_full Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_fullStr Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_full_unstemmed Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_short Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_sort reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
topic Computer Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736772/
https://www.ncbi.nlm.nih.gov/pubmed/31516953
http://dx.doi.org/10.1016/j.dib.2019.104432
work_keys_str_mv AT lastradiazjuanj reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity
AT goikoetxeajosu reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity
AT hadjtaiebmohamedali reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity
AT garciaserranoana reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity
AT aouichamohamedben reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity
AT agirreeneko reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity