Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts
BACKGROUND: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. METHODS: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labele...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665052/ https://www.ncbi.nlm.nih.gov/pubmed/19344480 http://dx.doi.org/10.1186/1471-2105-10-S3-S4 |
_version_ | 1782166016032243712 |
---|---|
author | Duan, Weisi Song, Min Yates, Alexander |
author_facet | Duan, Weisi Song, Min Yates, Alexander |
author_sort | Duan, Weisi |
collection | PubMed |
description | BACKGROUND: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. METHODS: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters. RESULTS: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system. CONCLUSION: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated. |
format | Text |
id | pubmed-2665052 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-26650522009-04-06 Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts Duan, Weisi Song, Min Yates, Alexander BMC Bioinformatics Proceedings BACKGROUND: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. METHODS: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters. RESULTS: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system. CONCLUSION: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated. BioMed Central 2009-03-19 /pmc/articles/PMC2665052/ /pubmed/19344480 http://dx.doi.org/10.1186/1471-2105-10-S3-S4 Text en Copyright © 2009 Duan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Duan, Weisi Song, Min Yates, Alexander Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts |
title | Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts |
title_full | Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts |
title_fullStr | Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts |
title_full_unstemmed | Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts |
title_short | Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts |
title_sort | fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665052/ https://www.ncbi.nlm.nih.gov/pubmed/19344480 http://dx.doi.org/10.1186/1471-2105-10-S3-S4 |
work_keys_str_mv | AT duanweisi fastmaxmarginclusteringforunsupervisedwordsensedisambiguationinbiomedicaltexts AT songmin fastmaxmarginclusteringforunsupervisedwordsensedisambiguationinbiomedicaltexts AT yatesalexander fastmaxmarginclusteringforunsupervisedwordsensedisambiguationinbiomedicaltexts |