Cargando…

Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information

Research into semantic similarity has a long history in lexical semantics, and it has applications in many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating semantic similarity is usually presented in the form of datasets which con...

Descripción completa

Detalles Bibliográficos
Autores principales:	Netisopakul, Ponrudee, Wohlgenannt, Gerhard, Pulich, Aleksei, Hlaing, Zar Zar
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7888635/ https://www.ncbi.nlm.nih.gov/pubmed/33596220 http://dx.doi.org/10.1371/journal.pone.0246751

_version_	1783652198936215552
author	Netisopakul, Ponrudee Wohlgenannt, Gerhard Pulich, Aleksei Hlaing, Zar Zar
author_facet	Netisopakul, Ponrudee Wohlgenannt, Gerhard Pulich, Aleksei Hlaing, Zar Zar
author_sort	Netisopakul, Ponrudee
collection	PubMed
description	Research into semantic similarity has a long history in lexical semantics, and it has applications in many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating semantic similarity is usually presented in the form of datasets which contain word pairs and a human-assigned similarity score. Algorithms are then evaluated by their ability to approximate the gold standard similarity scores. Many such datasets, with different characteristics, have been created for English language. Recently, four of those were transformed to Thai language versions, namely WordSim-353, SimLex-999, SemEval-2017-500, and R&G-65. Given those four datasets, in this work we aim to improve the previous baseline evaluations for Thai semantic similarity and solve challenges of unsegmented Asian languages (particularly the high fraction of out-of-vocabulary (OOV) dataset terms). To this end we apply and integrate different strategies to compute similarity, including traditional word-level embeddings, subword-unit embeddings, and ontological or hybrid sources like WordNet and ConceptNet. With our best model, which combines self-trained fastText subword embeddings with ConceptNet Numberbatch, we managed to raise the state-of-the-art, measured with the harmonic mean of Pearson on Spearman ρ, by a large margin from 0.356 to 0.688 for TH-WordSim-353, from 0.286 to 0.769 for TH-SemEval-500, from 0.397 to 0.717 for TH-SimLex-999, and from 0.505 to 0.901 for TWS-65.
format	Online Article Text
id	pubmed-7888635
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-78886352021-02-25 Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information Netisopakul, Ponrudee Wohlgenannt, Gerhard Pulich, Aleksei Hlaing, Zar Zar PLoS One Research Article Research into semantic similarity has a long history in lexical semantics, and it has applications in many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating semantic similarity is usually presented in the form of datasets which contain word pairs and a human-assigned similarity score. Algorithms are then evaluated by their ability to approximate the gold standard similarity scores. Many such datasets, with different characteristics, have been created for English language. Recently, four of those were transformed to Thai language versions, namely WordSim-353, SimLex-999, SemEval-2017-500, and R&G-65. Given those four datasets, in this work we aim to improve the previous baseline evaluations for Thai semantic similarity and solve challenges of unsegmented Asian languages (particularly the high fraction of out-of-vocabulary (OOV) dataset terms). To this end we apply and integrate different strategies to compute similarity, including traditional word-level embeddings, subword-unit embeddings, and ontological or hybrid sources like WordNet and ConceptNet. With our best model, which combines self-trained fastText subword embeddings with ConceptNet Numberbatch, we managed to raise the state-of-the-art, measured with the harmonic mean of Pearson on Spearman ρ, by a large margin from 0.356 to 0.688 for TH-WordSim-353, from 0.286 to 0.769 for TH-SemEval-500, from 0.397 to 0.717 for TH-SimLex-999, and from 0.505 to 0.901 for TWS-65. Public Library of Science 2021-02-17 /pmc/articles/PMC7888635/ /pubmed/33596220 http://dx.doi.org/10.1371/journal.pone.0246751 Text en © 2021 Netisopakul et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Netisopakul, Ponrudee Wohlgenannt, Gerhard Pulich, Aleksei Hlaing, Zar Zar Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information
title	Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information
title_full	Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information
title_fullStr	Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information
title_full_unstemmed	Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information
title_short	Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information
title_sort	improving the state-of-the-art in thai semantic similarity using distributional semantics and ontological information
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7888635/ https://www.ncbi.nlm.nih.gov/pubmed/33596220 http://dx.doi.org/10.1371/journal.pone.0246751
work_keys_str_mv	AT netisopakulponrudee improvingthestateoftheartinthaisemanticsimilarityusingdistributionalsemanticsandontologicalinformation AT wohlgenanntgerhard improvingthestateoftheartinthaisemanticsimilarityusingdistributionalsemanticsandontologicalinformation AT pulichaleksei improvingthestateoftheartinthaisemanticsimilarityusingdistributionalsemanticsandontologicalinformation AT hlaingzarzar improvingthestateoftheartinthaisemanticsimilarityusingdistributionalsemanticsandontologicalinformation

Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information

Ejemplares similares