Cargando…

Terminology extraction from medical texts in Polish

BACKGROUND: Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consisten...

Descripción completa

Detalles Bibliográficos
Autores principales: Marciniak, Małgorzata, Mykowiecka, Agnieszka
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4062289/
https://www.ncbi.nlm.nih.gov/pubmed/24976943
http://dx.doi.org/10.1186/2041-1480-5-24
_version_ 1782321624079400960
author Marciniak, Małgorzata
Mykowiecka, Agnieszka
author_facet Marciniak, Małgorzata
Mykowiecka, Agnieszka
author_sort Marciniak, Małgorzata
collection PubMed
description BACKGROUND: Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consistency of hospital documentation and obtaining statistical data. To perform this task we need information on the phrases we are looking for. At the moment, clinical Polish resources are sparse. The existing terminologies, such as Polish Medical Subject Headings (MeSH), do not provide sufficient coverage for clinical tasks. It would be helpful therefore if it were possible to automatically prepare, on the basis of a data sample, an initial set of terms which, after manual verification, could be used for the purpose of information extraction. RESULTS: Using a combination of linguistic and statistical methods for processing over 1200 children hospital discharge records, we obtained a list of single and multiword terms used in hospital discharge documents written in Polish. The phrases are ordered according to their presumed importance in domain texts measured by the frequency of use of a phrase and the variety of its contexts. The evaluation showed that the automatically identified phrases cover about 84% of terms in domain texts. At the top of the ranked list, only 4% out of 400 terms were incorrect while out of the final 200, 20% of expressions were either not domain related or syntactically incorrect. We also observed that 70% of the obtained terms are not included in the Polish MeSH. CONCLUSIONS: Automatic terminology extraction can give results which are of a quality high enough to be taken as a starting point for building domain related terminological dictionaries or ontologies. This approach can be useful for preparing terminological resources for very specific subdomains for which no relevant terminologies already exist. The evaluation performed showed that none of the tested ranking procedures were able to filter out all improperly constructed noun phrases from the top of the list. Careful choice of noun phrases is crucial to the usefulness of the created terminological resource in applications such as lexicon construction or acquisition of semantic relations from texts.
format Online
Article
Text
id pubmed-4062289
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40622892014-06-27 Terminology extraction from medical texts in Polish Marciniak, Małgorzata Mykowiecka, Agnieszka J Biomed Semantics Research BACKGROUND: Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consistency of hospital documentation and obtaining statistical data. To perform this task we need information on the phrases we are looking for. At the moment, clinical Polish resources are sparse. The existing terminologies, such as Polish Medical Subject Headings (MeSH), do not provide sufficient coverage for clinical tasks. It would be helpful therefore if it were possible to automatically prepare, on the basis of a data sample, an initial set of terms which, after manual verification, could be used for the purpose of information extraction. RESULTS: Using a combination of linguistic and statistical methods for processing over 1200 children hospital discharge records, we obtained a list of single and multiword terms used in hospital discharge documents written in Polish. The phrases are ordered according to their presumed importance in domain texts measured by the frequency of use of a phrase and the variety of its contexts. The evaluation showed that the automatically identified phrases cover about 84% of terms in domain texts. At the top of the ranked list, only 4% out of 400 terms were incorrect while out of the final 200, 20% of expressions were either not domain related or syntactically incorrect. We also observed that 70% of the obtained terms are not included in the Polish MeSH. CONCLUSIONS: Automatic terminology extraction can give results which are of a quality high enough to be taken as a starting point for building domain related terminological dictionaries or ontologies. This approach can be useful for preparing terminological resources for very specific subdomains for which no relevant terminologies already exist. The evaluation performed showed that none of the tested ranking procedures were able to filter out all improperly constructed noun phrases from the top of the list. Careful choice of noun phrases is crucial to the usefulness of the created terminological resource in applications such as lexicon construction or acquisition of semantic relations from texts. BioMed Central 2014-05-31 /pmc/articles/PMC4062289/ /pubmed/24976943 http://dx.doi.org/10.1186/2041-1480-5-24 Text en Copyright © 2014 Marciniak and Mykowiecka; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Research
Marciniak, Małgorzata
Mykowiecka, Agnieszka
Terminology extraction from medical texts in Polish
title Terminology extraction from medical texts in Polish
title_full Terminology extraction from medical texts in Polish
title_fullStr Terminology extraction from medical texts in Polish
title_full_unstemmed Terminology extraction from medical texts in Polish
title_short Terminology extraction from medical texts in Polish
title_sort terminology extraction from medical texts in polish
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4062289/
https://www.ncbi.nlm.nih.gov/pubmed/24976943
http://dx.doi.org/10.1186/2041-1480-5-24
work_keys_str_mv AT marciniakmałgorzata terminologyextractionfrommedicaltextsinpolish
AT mykowieckaagnieszka terminologyextractionfrommedicaltextsinpolish