Cargando…
Creating a medical dictionary using word alignment: The influence of sources and resources
BACKGROUND: Automatic word alignment of parallel texts with the same content in different languages is among other things used to generate dictionaries for new translations. The quality of the generated word alignment depends on the quality of the input resources. In this paper we report on automati...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2007
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2267171/ https://www.ncbi.nlm.nih.gov/pubmed/18036221 http://dx.doi.org/10.1186/1472-6947-7-37 |
_version_ | 1782151616323911680 |
---|---|
author | Nyström, Mikael Merkel, Magnus Petersson, Håkan Åhlfeldt, Hans |
author_facet | Nyström, Mikael Merkel, Magnus Petersson, Håkan Åhlfeldt, Hans |
author_sort | Nyström, Mikael |
collection | PubMed |
description | BACKGROUND: Automatic word alignment of parallel texts with the same content in different languages is among other things used to generate dictionaries for new translations. The quality of the generated word alignment depends on the quality of the input resources. In this paper we report on automatic word alignment of the English and Swedish versions of the medical terminology systems ICD-10, ICF, NCSP, KSH97-P and parts of MeSH and how the terminology systems and type of resources influence the quality. METHODS: We automatically word aligned the terminology systems using static resources, like dictionaries, statistical resources, like statistically derived dictionaries, and training resources, which were generated from manual word alignment. We varied which part of the terminology systems that we used to generate the resources, which parts that we word aligned and which types of resources we used in the alignment process to explore the influence the different terminology systems and resources have on the recall and precision. After the analysis, we used the best configuration of the automatic word alignment for generation of candidate term pairs. We then manually verified the candidate term pairs and included the correct pairs in an English-Swedish dictionary. RESULTS: The results indicate that more resources and resource types give better results but the size of the parts used to generate the resources only partly affects the quality. The most generally useful resources were generated from ICD-10 and resources generated from MeSH were not as general as other resources. Systematic inter-language differences in the structure of the terminology system rubrics make the rubrics harder to align. Manually created training resources give nearly as good results as a union of static resources, statistical resources and training resources and noticeably better results than a union of static resources and statistical resources. The verified English-Swedish dictionary contains 24,000 term pairs in base forms. CONCLUSION: More resources give better results in the automatic word alignment, but some resources only give small improvements. The most important type of resource is training and the most general resources were generated from ICD-10. |
format | Text |
id | pubmed-2267171 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2007 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-22671712008-03-13 Creating a medical dictionary using word alignment: The influence of sources and resources Nyström, Mikael Merkel, Magnus Petersson, Håkan Åhlfeldt, Hans BMC Med Inform Decis Mak Research Article BACKGROUND: Automatic word alignment of parallel texts with the same content in different languages is among other things used to generate dictionaries for new translations. The quality of the generated word alignment depends on the quality of the input resources. In this paper we report on automatic word alignment of the English and Swedish versions of the medical terminology systems ICD-10, ICF, NCSP, KSH97-P and parts of MeSH and how the terminology systems and type of resources influence the quality. METHODS: We automatically word aligned the terminology systems using static resources, like dictionaries, statistical resources, like statistically derived dictionaries, and training resources, which were generated from manual word alignment. We varied which part of the terminology systems that we used to generate the resources, which parts that we word aligned and which types of resources we used in the alignment process to explore the influence the different terminology systems and resources have on the recall and precision. After the analysis, we used the best configuration of the automatic word alignment for generation of candidate term pairs. We then manually verified the candidate term pairs and included the correct pairs in an English-Swedish dictionary. RESULTS: The results indicate that more resources and resource types give better results but the size of the parts used to generate the resources only partly affects the quality. The most generally useful resources were generated from ICD-10 and resources generated from MeSH were not as general as other resources. Systematic inter-language differences in the structure of the terminology system rubrics make the rubrics harder to align. Manually created training resources give nearly as good results as a union of static resources, statistical resources and training resources and noticeably better results than a union of static resources and statistical resources. The verified English-Swedish dictionary contains 24,000 term pairs in base forms. CONCLUSION: More resources give better results in the automatic word alignment, but some resources only give small improvements. The most important type of resource is training and the most general resources were generated from ICD-10. BioMed Central 2007-11-23 /pmc/articles/PMC2267171/ /pubmed/18036221 http://dx.doi.org/10.1186/1472-6947-7-37 Text en Copyright © 2007 Nyström et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Nyström, Mikael Merkel, Magnus Petersson, Håkan Åhlfeldt, Hans Creating a medical dictionary using word alignment: The influence of sources and resources |
title | Creating a medical dictionary using word alignment: The influence of sources and resources |
title_full | Creating a medical dictionary using word alignment: The influence of sources and resources |
title_fullStr | Creating a medical dictionary using word alignment: The influence of sources and resources |
title_full_unstemmed | Creating a medical dictionary using word alignment: The influence of sources and resources |
title_short | Creating a medical dictionary using word alignment: The influence of sources and resources |
title_sort | creating a medical dictionary using word alignment: the influence of sources and resources |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2267171/ https://www.ncbi.nlm.nih.gov/pubmed/18036221 http://dx.doi.org/10.1186/1472-6947-7-37 |
work_keys_str_mv | AT nystrommikael creatingamedicaldictionaryusingwordalignmenttheinfluenceofsourcesandresources AT merkelmagnus creatingamedicaldictionaryusingwordalignmenttheinfluenceofsourcesandresources AT peterssonhakan creatingamedicaldictionaryusingwordalignmenttheinfluenceofsourcesandresources AT ahlfeldthans creatingamedicaldictionaryusingwordalignmenttheinfluenceofsourcesandresources |