Cargando…

Expansion of medical vocabularies using distributional semantics on Japanese patient blogs

BACKGROUND: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ahltorp, Magnus, Skeppstedt, Maria, Kitajima, Shiho, Henriksson, Aron, Rzepka, Rafal, Araki, Kenji
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5037651/ https://www.ncbi.nlm.nih.gov/pubmed/27671202 http://dx.doi.org/10.1186/s13326-016-0093-x

_version_	1782455783035764736
author	Ahltorp, Magnus Skeppstedt, Maria Kitajima, Shiho Henriksson, Aron Rzepka, Rafal Araki, Kenji
author_facet	Ahltorp, Magnus Skeppstedt, Maria Kitajima, Shiho Henriksson, Aron Rzepka, Rafal Araki, Kenji
author_sort	Ahltorp, Magnus
collection	PubMed
description	BACKGROUND: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs. METHODS: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies. RESULTS: Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding. CONCLUSIONS: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.
format	Online Article Text
id	pubmed-5037651
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-50376512016-10-05 Expansion of medical vocabularies using distributional semantics on Japanese patient blogs Ahltorp, Magnus Skeppstedt, Maria Kitajima, Shiho Henriksson, Aron Rzepka, Rafal Araki, Kenji J Biomed Semantics Research BACKGROUND: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs. METHODS: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies. RESULTS: Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding. CONCLUSIONS: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion. BioMed Central 2016-09-26 /pmc/articles/PMC5037651/ /pubmed/27671202 http://dx.doi.org/10.1186/s13326-016-0093-x Text en © Ahltorp et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Ahltorp, Magnus Skeppstedt, Maria Kitajima, Shiho Henriksson, Aron Rzepka, Rafal Araki, Kenji Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
title	Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
title_full	Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
title_fullStr	Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
title_full_unstemmed	Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
title_short	Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
title_sort	expansion of medical vocabularies using distributional semantics on japanese patient blogs
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5037651/ https://www.ncbi.nlm.nih.gov/pubmed/27671202 http://dx.doi.org/10.1186/s13326-016-0093-x
work_keys_str_mv	AT ahltorpmagnus expansionofmedicalvocabulariesusingdistributionalsemanticsonjapanesepatientblogs AT skeppstedtmaria expansionofmedicalvocabulariesusingdistributionalsemanticsonjapanesepatientblogs AT kitajimashiho expansionofmedicalvocabulariesusingdistributionalsemanticsonjapanesepatientblogs AT henrikssonaron expansionofmedicalvocabulariesusingdistributionalsemanticsonjapanesepatientblogs AT rzepkarafal expansionofmedicalvocabulariesusingdistributionalsemanticsonjapanesepatientblogs AT arakikenji expansionofmedicalvocabulariesusingdistributionalsemanticsonjapanesepatientblogs

Expansion of medical vocabularies using distributional semantics on Japanese patient blogs

Ejemplares similares