Cargando…

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

BACKGROUND: Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jimeno-Yepes, Antonio J, McInnes, Bridget T, Aronson, Alan R
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3123611/ https://www.ncbi.nlm.nih.gov/pubmed/21635749 http://dx.doi.org/10.1186/1471-2105-12-223

_version_	1782207003143176192
author	Jimeno-Yepes, Antonio J McInnes, Bridget T Aronson, Alan R
author_facet	Jimeno-Yepes, Antonio J McInnes, Bridget T Aronson, Alan R
author_sort	Jimeno-Yepes, Antonio J
collection	PubMed
description	BACKGROUND: Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD. METHODS: In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set. RESULTS: The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE. We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods. CONCLUSIONS: The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.
format	Online Article Text
id	pubmed-3123611
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31236112011-06-26 Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation Jimeno-Yepes, Antonio J McInnes, Bridget T Aronson, Alan R BMC Bioinformatics Research Article BACKGROUND: Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD. METHODS: In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set. RESULTS: The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE. We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods. CONCLUSIONS: The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions. BioMed Central 2011-06-02 /pmc/articles/PMC3123611/ /pubmed/21635749 http://dx.doi.org/10.1186/1471-2105-12-223 Text en Copyright ©2011 Jimeno-Yepes et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Jimeno-Yepes, Antonio J McInnes, Bridget T Aronson, Alan R Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation
title	Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation
title_full	Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation
title_fullStr	Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation
title_full_unstemmed	Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation
title_short	Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation
title_sort	exploiting mesh indexing in medline to generate a data set for word sense disambiguation
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3123611/ https://www.ncbi.nlm.nih.gov/pubmed/21635749 http://dx.doi.org/10.1186/1471-2105-12-223
work_keys_str_mv	AT jimenoyepesantonioj exploitingmeshindexinginmedlinetogenerateadatasetforwordsensedisambiguation AT mcinnesbridgett exploitingmeshindexinginmedlinetogenerateadatasetforwordsensedisambiguation AT aronsonalanr exploitingmeshindexinginmedlinetogenerateadatasetforwordsensedisambiguation

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

Ejemplares similares