Cargando…

Generalising semantic category disambiguation with large lexical resources for fun and profit

BACKGROUND: Semantic Category Disambiguation (SCD) is the task of assigning the appropriate semantic category to given spans of text from a fixed set of candidate categories, for example Protein to “Fibrin”. SCD is relevant to Natural Language Processing tasks such as Named Entity Recognition, coref...

Descripción completa

Detalles Bibliográficos
Autores principales:	Stenetorp, Pontus, Pyysalo, Sampo, Ananiadou, Sophia, Tsujii, Jun’ichi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4107982/ https://www.ncbi.nlm.nih.gov/pubmed/25093067 http://dx.doi.org/10.1186/2041-1480-5-26

_version_	1782327688302690304
author	Stenetorp, Pontus Pyysalo, Sampo Ananiadou, Sophia Tsujii, Jun’ichi
author_facet	Stenetorp, Pontus Pyysalo, Sampo Ananiadou, Sophia Tsujii, Jun’ichi
author_sort	Stenetorp, Pontus
collection	PubMed
description	BACKGROUND: Semantic Category Disambiguation (SCD) is the task of assigning the appropriate semantic category to given spans of text from a fixed set of candidate categories, for example Protein to “Fibrin”. SCD is relevant to Natural Language Processing tasks such as Named Entity Recognition, coreference resolution and coordination resolution. In this work, we study machine learning-based SCD methods using large lexical resources and approximate string matching, aiming to generalise these methods with regard to domains, lexical resources and the composition of data sets. We specifically consider the applicability of SCD for the purposes of supporting human annotators and acting as a pipeline component for other Natural Language Processing systems. RESULTS: While previous research has mostly cast SCD purely as a classification task, we consider a task setting that allows for multiple semantic categories to be suggested, aiming to minimise the number of suggestions while maintaining high recall. We argue that this setting reflects aspects which are essential for both a pipeline component and when supporting human annotators. We introduce an SCD method based on a recently introduced machine learning-based system and evaluate it on 15 corpora covering biomedical, clinical and newswire texts and ranging in the number of semantic categories from 2 to 91. With appropriate settings, our system maintains an average recall of 99% while reducing the number of candidate semantic categories on average by 65% over all data sets. CONCLUSIONS: Machine learning-based SCD using large lexical resources and approximate string matching is sensitive to the selection and granularity of lexical resources, but generalises well to a wide range of text domains and data sets given appropriate resources and parameter settings. By substantially reducing the number of candidate categories while only very rarely excluding the correct one, our method is shown to be applicable to manual annotation support tasks and use as a high-recall component in text processing pipelines. The introduced system and all related resources are freely available for research purposes at: https://github.com/ninjin/simsem.
format	Online Article Text
id	pubmed-4107982
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41079822014-08-04 Generalising semantic category disambiguation with large lexical resources for fun and profit Stenetorp, Pontus Pyysalo, Sampo Ananiadou, Sophia Tsujii, Jun’ichi J Biomed Semantics Research BACKGROUND: Semantic Category Disambiguation (SCD) is the task of assigning the appropriate semantic category to given spans of text from a fixed set of candidate categories, for example Protein to “Fibrin”. SCD is relevant to Natural Language Processing tasks such as Named Entity Recognition, coreference resolution and coordination resolution. In this work, we study machine learning-based SCD methods using large lexical resources and approximate string matching, aiming to generalise these methods with regard to domains, lexical resources and the composition of data sets. We specifically consider the applicability of SCD for the purposes of supporting human annotators and acting as a pipeline component for other Natural Language Processing systems. RESULTS: While previous research has mostly cast SCD purely as a classification task, we consider a task setting that allows for multiple semantic categories to be suggested, aiming to minimise the number of suggestions while maintaining high recall. We argue that this setting reflects aspects which are essential for both a pipeline component and when supporting human annotators. We introduce an SCD method based on a recently introduced machine learning-based system and evaluate it on 15 corpora covering biomedical, clinical and newswire texts and ranging in the number of semantic categories from 2 to 91. With appropriate settings, our system maintains an average recall of 99% while reducing the number of candidate semantic categories on average by 65% over all data sets. CONCLUSIONS: Machine learning-based SCD using large lexical resources and approximate string matching is sensitive to the selection and granularity of lexical resources, but generalises well to a wide range of text domains and data sets given appropriate resources and parameter settings. By substantially reducing the number of candidate categories while only very rarely excluding the correct one, our method is shown to be applicable to manual annotation support tasks and use as a high-recall component in text processing pipelines. The introduced system and all related resources are freely available for research purposes at: https://github.com/ninjin/simsem. BioMed Central 2014-06-02 /pmc/articles/PMC4107982/ /pubmed/25093067 http://dx.doi.org/10.1186/2041-1480-5-26 Text en Copyright © 2014 Stenetorp et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle	Research Stenetorp, Pontus Pyysalo, Sampo Ananiadou, Sophia Tsujii, Jun’ichi Generalising semantic category disambiguation with large lexical resources for fun and profit
title	Generalising semantic category disambiguation with large lexical resources for fun and profit
title_full	Generalising semantic category disambiguation with large lexical resources for fun and profit
title_fullStr	Generalising semantic category disambiguation with large lexical resources for fun and profit
title_full_unstemmed	Generalising semantic category disambiguation with large lexical resources for fun and profit
title_short	Generalising semantic category disambiguation with large lexical resources for fun and profit
title_sort	generalising semantic category disambiguation with large lexical resources for fun and profit
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4107982/ https://www.ncbi.nlm.nih.gov/pubmed/25093067 http://dx.doi.org/10.1186/2041-1480-5-26
work_keys_str_mv	AT stenetorppontus generalisingsemanticcategorydisambiguationwithlargelexicalresourcesforfunandprofit AT pyysalosampo generalisingsemanticcategorydisambiguationwithlargelexicalresourcesforfunandprofit AT ananiadousophia generalisingsemanticcategorydisambiguationwithlargelexicalresourcesforfunandprofit AT tsujiijunichi generalisingsemanticcategorydisambiguationwithlargelexicalresourcesforfunandprofit

Generalising semantic category disambiguation with large lexical resources for fun and profit

Ejemplares similares