Cargando…

Synonym extraction and abbreviation expansion with ensembles of semantic spaces

BACKGROUND: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual constr...

Descripción completa

Detalles Bibliográficos
Autores principales: Henriksson, Aron, Moen, Hans, Skeppstedt, Maria, Daudaravičius, Vidas, Duneld, Martin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3937097/
https://www.ncbi.nlm.nih.gov/pubmed/24499679
http://dx.doi.org/10.1186/2041-1480-5-6
_version_ 1782305428184498176
author Henriksson, Aron
Moen, Hans
Skeppstedt, Maria
Daudaravičius, Vidas
Duneld, Martin
author_facet Henriksson, Aron
Moen, Hans
Skeppstedt, Maria
Daudaravičius, Vidas
Duneld, Martin
author_sort Henriksson, Aron
collection PubMed
description BACKGROUND: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. RESULTS: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. CONCLUSIONS: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.
format Online
Article
Text
id pubmed-3937097
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-39370972014-03-06 Synonym extraction and abbreviation expansion with ensembles of semantic spaces Henriksson, Aron Moen, Hans Skeppstedt, Maria Daudaravičius, Vidas Duneld, Martin J Biomed Semantics Research BACKGROUND: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. RESULTS: A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. CONCLUSIONS: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks. BioMed Central 2014-02-05 /pmc/articles/PMC3937097/ /pubmed/24499679 http://dx.doi.org/10.1186/2041-1480-5-6 Text en Copyright © 2014 Henriksson et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Henriksson, Aron
Moen, Hans
Skeppstedt, Maria
Daudaravičius, Vidas
Duneld, Martin
Synonym extraction and abbreviation expansion with ensembles of semantic spaces
title Synonym extraction and abbreviation expansion with ensembles of semantic spaces
title_full Synonym extraction and abbreviation expansion with ensembles of semantic spaces
title_fullStr Synonym extraction and abbreviation expansion with ensembles of semantic spaces
title_full_unstemmed Synonym extraction and abbreviation expansion with ensembles of semantic spaces
title_short Synonym extraction and abbreviation expansion with ensembles of semantic spaces
title_sort synonym extraction and abbreviation expansion with ensembles of semantic spaces
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3937097/
https://www.ncbi.nlm.nih.gov/pubmed/24499679
http://dx.doi.org/10.1186/2041-1480-5-6
work_keys_str_mv AT henrikssonaron synonymextractionandabbreviationexpansionwithensemblesofsemanticspaces
AT moenhans synonymextractionandabbreviationexpansionwithensemblesofsemanticspaces
AT skeppstedtmaria synonymextractionandabbreviationexpansionwithensemblesofsemanticspaces
AT daudaraviciusvidas synonymextractionandabbreviationexpansionwithensemblesofsemanticspaces
AT duneldmartin synonymextractionandabbreviationexpansionwithensemblesofsemanticspaces