Cargando…
Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity metho...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736772/ https://www.ncbi.nlm.nih.gov/pubmed/31516953 http://dx.doi.org/10.1016/j.dib.2019.104432 |
_version_ | 1783450552550555648 |
---|---|
author | Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko |
author_facet | Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko |
author_sort | Lastra-Díaz, Juan J. |
collection | PubMed |
description | This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks. |
format | Online Article Text |
id | pubmed-6736772 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-67367722019-09-12 Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko Data Brief Computer Science This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks. Elsevier 2019-08-26 /pmc/articles/PMC6736772/ /pubmed/31516953 http://dx.doi.org/10.1016/j.dib.2019.104432 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Computer Science Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity |
title | Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity |
title_full | Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity |
title_fullStr | Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity |
title_full_unstemmed | Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity |
title_short | Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity |
title_sort | reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity |
topic | Computer Science |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736772/ https://www.ncbi.nlm.nih.gov/pubmed/31516953 http://dx.doi.org/10.1016/j.dib.2019.104432 |
work_keys_str_mv | AT lastradiazjuanj reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT goikoetxeajosu reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT hadjtaiebmohamedali reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT garciaserranoana reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT aouichamohamedben reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT agirreeneko reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity |