Cargando…

Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering

BACKGROUND: Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using strin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yamaguchi, Atsuko, Yamamoto, Yasunori, Kim, Jin-Dong, Takagi, Toshihisa, Yonezawa, Akinori
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394426/ https://www.ncbi.nlm.nih.gov/pubmed/22759617 http://dx.doi.org/10.1186/1471-2164-13-S3-S8

_version_	1782237867607588864
author	Yamaguchi, Atsuko Yamamoto, Yasunori Kim, Jin-Dong Takagi, Toshihisa Yonezawa, Akinori
author_facet	Yamaguchi, Atsuko Yamamoto, Yasunori Kim, Jin-Dong Takagi, Toshihisa Yonezawa, Akinori
author_sort	Yamaguchi, Atsuko
collection	PubMed
description	BACKGROUND: Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names. RESULTS: Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately. CONCLUSIONS: In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering.
format	Online Article Text
id	pubmed-3394426
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-33944262012-07-16 Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering Yamaguchi, Atsuko Yamamoto, Yasunori Kim, Jin-Dong Takagi, Toshihisa Yonezawa, Akinori BMC Genomics Proceedings BACKGROUND: Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names. RESULTS: Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately. CONCLUSIONS: In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering. BioMed Central 2012-06-11 /pmc/articles/PMC3394426/ /pubmed/22759617 http://dx.doi.org/10.1186/1471-2164-13-S3-S8 Text en Copyright ©2012 Yamaguchi et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Yamaguchi, Atsuko Yamamoto, Yasunori Kim, Jin-Dong Takagi, Toshihisa Yonezawa, Akinori Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering
title	Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering
title_full	Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering
title_fullStr	Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering
title_full_unstemmed	Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering
title_short	Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering
title_sort	discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394426/ https://www.ncbi.nlm.nih.gov/pubmed/22759617 http://dx.doi.org/10.1186/1471-2164-13-S3-S8
work_keys_str_mv	AT yamaguchiatsuko discriminativeapplicationofstringsimilaritymethodstochemicalandnonchemicalnamesforbiomedicalabbreviationclustering AT yamamotoyasunori discriminativeapplicationofstringsimilaritymethodstochemicalandnonchemicalnamesforbiomedicalabbreviationclustering AT kimjindong discriminativeapplicationofstringsimilaritymethodstochemicalandnonchemicalnamesforbiomedicalabbreviationclustering AT takagitoshihisa discriminativeapplicationofstringsimilaritymethodstochemicalandnonchemicalnamesforbiomedicalabbreviationclustering AT yonezawaakinori discriminativeapplicationofstringsimilaritymethodstochemicalandnonchemicalnamesforbiomedicalabbreviationclustering

Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering

Ejemplares similares