Cargando…

Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity metho...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lastra-Díaz, Juan J., Goikoetxea, Josu, Hadj Taieb, Mohamed Ali, García-Serrano, Ana, Aouicha, Mohamed Ben, Agirre, Eneko
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2019
Materias:	Computer Science
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736772/ https://www.ncbi.nlm.nih.gov/pubmed/31516953 http://dx.doi.org/10.1016/j.dib.2019.104432

_version_	1783450552550555648
author	Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko
author_facet	Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko
author_sort	Lastra-Díaz, Juan J.
collection	PubMed
description	This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks.
format	Online Article Text
id	pubmed-6736772
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-67367722019-09-12 Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko Data Brief Computer Science This data article introduces a reproducibility dataset with the aim of allowing the exact replication of all experiments, results and data tables introduced in our companion paper (Lastra-Díaz et al., 2019), which introduces the largest experimental survey on ontology-based semantic similarity methods and Word Embeddings (WE) for word similarity reported in the literature. The implementation of all our experiments, as well as the gathering of all raw data derived from them, was based on the software implementation and evaluation of all methods in HESML library (Lastra-Díaz et al., 2017), and their subsequent recording with Reprozip (Chirigati et al., 2016). Raw data is made up by a collection of data files gathering the raw word-similarity values returned by each method for each word pair evaluated in any benchmark. Raw data files were processed by running a R-language script with the aim of computing all evaluation metrics reported in (Lastra-Díaz et al., 2019), such as Pearson and Spearman correlation, harmonic score and statistical significance p-values, as well as to generate automatically all data tables shown in our companion paper. Our dataset provides all input data files, resources and complementary software tools to reproduce from scratch all our experimental data, statistical analysis and reported data. Finally, our reproducibility dataset provides a self-contained experimentation platform which allows to run new word similarity benchmarks by setting up new experiments including other unconsidered methods or word similarity benchmarks. Elsevier 2019-08-26 /pmc/articles/PMC6736772/ /pubmed/31516953 http://dx.doi.org/10.1016/j.dib.2019.104432 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Computer Science Lastra-Díaz, Juan J. Goikoetxea, Josu Hadj Taieb, Mohamed Ali García-Serrano, Ana Aouicha, Mohamed Ben Agirre, Eneko Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title	Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_full	Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_fullStr	Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_full_unstemmed	Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_short	Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
title_sort	reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity
topic	Computer Science
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6736772/ https://www.ncbi.nlm.nih.gov/pubmed/31516953 http://dx.doi.org/10.1016/j.dib.2019.104432
work_keys_str_mv	AT lastradiazjuanj reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT goikoetxeajosu reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT hadjtaiebmohamedali reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT garciaserranoana reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT aouichamohamedben reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity AT agirreeneko reproducibilitydatasetforalargeexperimentalsurveyonwordembeddingsandontologybasedmethodsforwordsimilarity

Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity

Ejemplares similares