Cargando…
Terminology extraction from medical texts in Polish
BACKGROUND: Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consisten...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4062289/ https://www.ncbi.nlm.nih.gov/pubmed/24976943 http://dx.doi.org/10.1186/2041-1480-5-24 |
_version_ | 1782321624079400960 |
---|---|
author | Marciniak, Małgorzata Mykowiecka, Agnieszka |
author_facet | Marciniak, Małgorzata Mykowiecka, Agnieszka |
author_sort | Marciniak, Małgorzata |
collection | PubMed |
description | BACKGROUND: Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consistency of hospital documentation and obtaining statistical data. To perform this task we need information on the phrases we are looking for. At the moment, clinical Polish resources are sparse. The existing terminologies, such as Polish Medical Subject Headings (MeSH), do not provide sufficient coverage for clinical tasks. It would be helpful therefore if it were possible to automatically prepare, on the basis of a data sample, an initial set of terms which, after manual verification, could be used for the purpose of information extraction. RESULTS: Using a combination of linguistic and statistical methods for processing over 1200 children hospital discharge records, we obtained a list of single and multiword terms used in hospital discharge documents written in Polish. The phrases are ordered according to their presumed importance in domain texts measured by the frequency of use of a phrase and the variety of its contexts. The evaluation showed that the automatically identified phrases cover about 84% of terms in domain texts. At the top of the ranked list, only 4% out of 400 terms were incorrect while out of the final 200, 20% of expressions were either not domain related or syntactically incorrect. We also observed that 70% of the obtained terms are not included in the Polish MeSH. CONCLUSIONS: Automatic terminology extraction can give results which are of a quality high enough to be taken as a starting point for building domain related terminological dictionaries or ontologies. This approach can be useful for preparing terminological resources for very specific subdomains for which no relevant terminologies already exist. The evaluation performed showed that none of the tested ranking procedures were able to filter out all improperly constructed noun phrases from the top of the list. Careful choice of noun phrases is crucial to the usefulness of the created terminological resource in applications such as lexicon construction or acquisition of semantic relations from texts. |
format | Online Article Text |
id | pubmed-4062289 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-40622892014-06-27 Terminology extraction from medical texts in Polish Marciniak, Małgorzata Mykowiecka, Agnieszka J Biomed Semantics Research BACKGROUND: Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consistency of hospital documentation and obtaining statistical data. To perform this task we need information on the phrases we are looking for. At the moment, clinical Polish resources are sparse. The existing terminologies, such as Polish Medical Subject Headings (MeSH), do not provide sufficient coverage for clinical tasks. It would be helpful therefore if it were possible to automatically prepare, on the basis of a data sample, an initial set of terms which, after manual verification, could be used for the purpose of information extraction. RESULTS: Using a combination of linguistic and statistical methods for processing over 1200 children hospital discharge records, we obtained a list of single and multiword terms used in hospital discharge documents written in Polish. The phrases are ordered according to their presumed importance in domain texts measured by the frequency of use of a phrase and the variety of its contexts. The evaluation showed that the automatically identified phrases cover about 84% of terms in domain texts. At the top of the ranked list, only 4% out of 400 terms were incorrect while out of the final 200, 20% of expressions were either not domain related or syntactically incorrect. We also observed that 70% of the obtained terms are not included in the Polish MeSH. CONCLUSIONS: Automatic terminology extraction can give results which are of a quality high enough to be taken as a starting point for building domain related terminological dictionaries or ontologies. This approach can be useful for preparing terminological resources for very specific subdomains for which no relevant terminologies already exist. The evaluation performed showed that none of the tested ranking procedures were able to filter out all improperly constructed noun phrases from the top of the list. Careful choice of noun phrases is crucial to the usefulness of the created terminological resource in applications such as lexicon construction or acquisition of semantic relations from texts. BioMed Central 2014-05-31 /pmc/articles/PMC4062289/ /pubmed/24976943 http://dx.doi.org/10.1186/2041-1480-5-24 Text en Copyright © 2014 Marciniak and Mykowiecka; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. |
spellingShingle | Research Marciniak, Małgorzata Mykowiecka, Agnieszka Terminology extraction from medical texts in Polish |
title | Terminology extraction from medical texts in Polish |
title_full | Terminology extraction from medical texts in Polish |
title_fullStr | Terminology extraction from medical texts in Polish |
title_full_unstemmed | Terminology extraction from medical texts in Polish |
title_short | Terminology extraction from medical texts in Polish |
title_sort | terminology extraction from medical texts in polish |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4062289/ https://www.ncbi.nlm.nih.gov/pubmed/24976943 http://dx.doi.org/10.1186/2041-1480-5-24 |
work_keys_str_mv | AT marciniakmałgorzata terminologyextractionfrommedicaltextsinpolish AT mykowieckaagnieszka terminologyextractionfrommedicaltextsinpolish |