Cargando…

Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues

BACKGROUND: Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xu, Hua, Markatou, Marianthi, Dimova, Rositsa, Liu, Hongfang, Friedman, Carol
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1550263/ https://www.ncbi.nlm.nih.gov/pubmed/16822321 http://dx.doi.org/10.1186/1471-2105-7-334

_version_	1782129215319048192
author	Xu, Hua Markatou, Marianthi Dimova, Rositsa Liu, Hongfang Friedman, Carol
author_facet	Xu, Hua Markatou, Marianthi Dimova, Rositsa Liu, Hongfang Friedman, Carol
author_sort	Xu, Hua
collection	PubMed
description	BACKGROUND: Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance. RESULTS: Experiments were designed to measure the effect of "sample size" (i.e. size of the datasets), "sense distribution" (i.e. the distribution of the different meanings of the ambiguous word) and "degree of difficulty" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations: BPD, BSA, PCA, and RSV, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated. CONCLUSION: Several different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology.
format	Text
id	pubmed-1550263
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15502632006-08-19 Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues Xu, Hua Markatou, Marianthi Dimova, Rositsa Liu, Hongfang Friedman, Carol BMC Bioinformatics Research Article BACKGROUND: Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance. RESULTS: Experiments were designed to measure the effect of "sample size" (i.e. size of the datasets), "sense distribution" (i.e. the distribution of the different meanings of the ambiguous word) and "degree of difficulty" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations: BPD, BSA, PCA, and RSV, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated. CONCLUSION: Several different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology. BioMed Central 2006-07-05 /pmc/articles/PMC1550263/ /pubmed/16822321 http://dx.doi.org/10.1186/1471-2105-7-334 Text en Copyright © 2006 Xu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Xu, Hua Markatou, Marianthi Dimova, Rositsa Liu, Hongfang Friedman, Carol Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
title	Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
title_full	Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
title_fullStr	Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
title_full_unstemmed	Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
title_short	Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
title_sort	machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1550263/ https://www.ncbi.nlm.nih.gov/pubmed/16822321 http://dx.doi.org/10.1186/1471-2105-7-334
work_keys_str_mv	AT xuhua machinelearningandwordsensedisambiguationinthebiomedicaldomaindesignandevaluationissues AT markatoumarianthi machinelearningandwordsensedisambiguationinthebiomedicaldomaindesignandevaluationissues AT dimovarositsa machinelearningandwordsensedisambiguationinthebiomedicaldomaindesignandevaluationissues AT liuhongfang machinelearningandwordsensedisambiguationinthebiomedicaldomaindesignandevaluationissues AT friedmancarol machinelearningandwordsensedisambiguationinthebiomedicaldomaindesignandevaluationissues

Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues

Ejemplares similares