Cargando…

Protocol for a reproducible experimental survey on biomedical sentence similarity

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lara-Clares, Alicia, Lastra-Díaz, Juan J., Garcia-Serrano, Ana
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2021
Materias:	Registered Report Protocol
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7990182/ https://www.ncbi.nlm.nih.gov/pubmed/33760855 http://dx.doi.org/10.1371/journal.pone.0248663

_version_	1783669028453089280
author	Lara-Clares, Alicia Lastra-Díaz, Juan J. Garcia-Serrano, Ana
author_facet	Lara-Clares, Alicia Lastra-Díaz, Juan J. Garcia-Serrano, Ana
author_sort	Lara-Clares, Alicia
collection	PubMed
description	Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
format	Online Article Text
id	pubmed-7990182
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-79901822021-04-05 Protocol for a reproducible experimental survey on biomedical sentence similarity Lara-Clares, Alicia Lastra-Díaz, Juan J. Garcia-Serrano, Ana PLoS One Registered Report Protocol Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. Public Library of Science 2021-03-24 /pmc/articles/PMC7990182/ /pubmed/33760855 http://dx.doi.org/10.1371/journal.pone.0248663 Text en © 2021 Lara-Clares et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Registered Report Protocol Lara-Clares, Alicia Lastra-Díaz, Juan J. Garcia-Serrano, Ana Protocol for a reproducible experimental survey on biomedical sentence similarity
title	Protocol for a reproducible experimental survey on biomedical sentence similarity
title_full	Protocol for a reproducible experimental survey on biomedical sentence similarity
title_fullStr	Protocol for a reproducible experimental survey on biomedical sentence similarity
title_full_unstemmed	Protocol for a reproducible experimental survey on biomedical sentence similarity
title_short	Protocol for a reproducible experimental survey on biomedical sentence similarity
title_sort	protocol for a reproducible experimental survey on biomedical sentence similarity
topic	Registered Report Protocol
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7990182/ https://www.ncbi.nlm.nih.gov/pubmed/33760855 http://dx.doi.org/10.1371/journal.pone.0248663
work_keys_str_mv	AT laraclaresalicia protocolforareproducibleexperimentalsurveyonbiomedicalsentencesimilarity AT lastradiazjuanj protocolforareproducibleexperimentalsurveyonbiomedicalsentencesimilarity AT garciaserranoana protocolforareproducibleexperimentalsurveyonbiomedicalsentencesimilarity

Protocol for a reproducible experimental survey on biomedical sentence similarity

Ejemplares similares