Cargando…

Synonym set extraction from the biomedical literature by lexical pattern discovery

BACKGROUND: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst [1], but it is still not clear how to automatically construct such pattern...

Descripción completa

Detalles Bibliográficos
Autores principales: McCrae, John, Collier, Nigel
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2335115/
https://www.ncbi.nlm.nih.gov/pubmed/18366721
http://dx.doi.org/10.1186/1471-2105-9-159
_version_ 1782152811708940288
author McCrae, John
Collier, Nigel
author_facet McCrae, John
Collier, Nigel
author_sort McCrae, John
collection PubMed
description BACKGROUND: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst [1], but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way. RESULTS: We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia. CONCLUSION: We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.
format Text
id pubmed-2335115
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-23351152008-04-25 Synonym set extraction from the biomedical literature by lexical pattern discovery McCrae, John Collier, Nigel BMC Bioinformatics Research Article BACKGROUND: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst [1], but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way. RESULTS: We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia. CONCLUSION: We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets. BioMed Central 2008-03-24 /pmc/articles/PMC2335115/ /pubmed/18366721 http://dx.doi.org/10.1186/1471-2105-9-159 Text en Copyright © 2008 McCrae and Collier; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
McCrae, John
Collier, Nigel
Synonym set extraction from the biomedical literature by lexical pattern discovery
title Synonym set extraction from the biomedical literature by lexical pattern discovery
title_full Synonym set extraction from the biomedical literature by lexical pattern discovery
title_fullStr Synonym set extraction from the biomedical literature by lexical pattern discovery
title_full_unstemmed Synonym set extraction from the biomedical literature by lexical pattern discovery
title_short Synonym set extraction from the biomedical literature by lexical pattern discovery
title_sort synonym set extraction from the biomedical literature by lexical pattern discovery
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2335115/
https://www.ncbi.nlm.nih.gov/pubmed/18366721
http://dx.doi.org/10.1186/1471-2105-9-159
work_keys_str_mv AT mccraejohn synonymsetextractionfromthebiomedicalliteraturebylexicalpatterndiscovery
AT colliernigel synonymsetextractionfromthebiomedicalliteraturebylexicalpatterndiscovery