Cargando…

Similarity corpus on microbial transcriptional regulation

BACKGROUND: The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction w...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lithgow-Serrano, Oscar, Gama-Castro, Socorro, Ishida-Gutiérrez, Cecilia, Mejía-Almonte, Citlalli, Tierrafría, Víctor H., Martínez-Luna, Sara, Santos-Zavaleta, Alberto, Velázquez-Ramírez, David, Collado-Vides, Julio
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532127/ https://www.ncbi.nlm.nih.gov/pubmed/31118102 http://dx.doi.org/10.1186/s13326-019-0200-x

_version_	1783420945760780288
author	Lithgow-Serrano, Oscar Gama-Castro, Socorro Ishida-Gutiérrez, Cecilia Mejía-Almonte, Citlalli Tierrafría, Víctor H. Martínez-Luna, Sara Santos-Zavaleta, Alberto Velázquez-Ramírez, David Collado-Vides, Julio
author_facet	Lithgow-Serrano, Oscar Gama-Castro, Socorro Ishida-Gutiérrez, Cecilia Mejía-Almonte, Citlalli Tierrafría, Víctor H. Martínez-Luna, Sara Santos-Zavaleta, Alberto Velázquez-Ramírez, David Collado-Vides, Julio
author_sort	Lithgow-Serrano, Oscar
collection	PubMed
description	BACKGROUND: The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource. RESULTS: Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed. CONCLUSIONS: To the best of our knowledge, this is the first similarity corpus—a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair—in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.
format	Online Article Text
id	pubmed-6532127
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-65321272019-05-28 Similarity corpus on microbial transcriptional regulation Lithgow-Serrano, Oscar Gama-Castro, Socorro Ishida-Gutiérrez, Cecilia Mejía-Almonte, Citlalli Tierrafría, Víctor H. Martínez-Luna, Sara Santos-Zavaleta, Alberto Velázquez-Ramírez, David Collado-Vides, Julio J Biomed Semantics Research BACKGROUND: The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource. RESULTS: Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed. CONCLUSIONS: To the best of our knowledge, this is the first similarity corpus—a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair—in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing. BioMed Central 2019-05-22 /pmc/articles/PMC6532127/ /pubmed/31118102 http://dx.doi.org/10.1186/s13326-019-0200-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Lithgow-Serrano, Oscar Gama-Castro, Socorro Ishida-Gutiérrez, Cecilia Mejía-Almonte, Citlalli Tierrafría, Víctor H. Martínez-Luna, Sara Santos-Zavaleta, Alberto Velázquez-Ramírez, David Collado-Vides, Julio Similarity corpus on microbial transcriptional regulation
title	Similarity corpus on microbial transcriptional regulation
title_full	Similarity corpus on microbial transcriptional regulation
title_fullStr	Similarity corpus on microbial transcriptional regulation
title_full_unstemmed	Similarity corpus on microbial transcriptional regulation
title_short	Similarity corpus on microbial transcriptional regulation
title_sort	similarity corpus on microbial transcriptional regulation
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532127/ https://www.ncbi.nlm.nih.gov/pubmed/31118102 http://dx.doi.org/10.1186/s13326-019-0200-x
work_keys_str_mv	AT lithgowserranooscar similaritycorpusonmicrobialtranscriptionalregulation AT gamacastrosocorro similaritycorpusonmicrobialtranscriptionalregulation AT ishidagutierrezcecilia similaritycorpusonmicrobialtranscriptionalregulation AT mejiaalmontecitlalli similaritycorpusonmicrobialtranscriptionalregulation AT tierrafriavictorh similaritycorpusonmicrobialtranscriptionalregulation AT martinezlunasara similaritycorpusonmicrobialtranscriptionalregulation AT santoszavaletaalberto similaritycorpusonmicrobialtranscriptionalregulation AT velazquezramirezdavid similaritycorpusonmicrobialtranscriptionalregulation AT colladovidesjulio similaritycorpusonmicrobialtranscriptionalregulation

Similarity corpus on microbial transcriptional regulation

Ejemplares similares