Cargando…

Rewriting and suppressing UMLS terms for improved biomedical term identification

BACKGROUND: Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine t...

Descripción completa

Detalles Bibliográficos
Autores principales: Hettne, Kristina M, van Mulligen, Erik M, Schuemie, Martijn J, Schijvenaars, Bob JA, Kors, Jan A
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895736/
https://www.ncbi.nlm.nih.gov/pubmed/20618981
http://dx.doi.org/10.1186/2041-1480-1-5
_version_ 1782183286529851392
author Hettne, Kristina M
van Mulligen, Erik M
Schuemie, Martijn J
Schijvenaars, Bob JA
Kors, Jan A
author_facet Hettne, Kristina M
van Mulligen, Erik M
Schuemie, Martijn J
Schijvenaars, Bob JA
Kors, Jan A
author_sort Hettne, Kristina M
collection PubMed
description BACKGROUND: Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. RESULTS: Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. CONCLUSIONS: We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper.
format Text
id pubmed-2895736
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28957362010-07-06 Rewriting and suppressing UMLS terms for improved biomedical term identification Hettne, Kristina M van Mulligen, Erik M Schuemie, Martijn J Schijvenaars, Bob JA Kors, Jan A J Biomed Semantics Research BACKGROUND: Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. RESULTS: Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. CONCLUSIONS: We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper. BioMed Central 2010-03-31 /pmc/articles/PMC2895736/ /pubmed/20618981 http://dx.doi.org/10.1186/2041-1480-1-5 Text en Copyright ©2010 Hettne et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Hettne, Kristina M
van Mulligen, Erik M
Schuemie, Martijn J
Schijvenaars, Bob JA
Kors, Jan A
Rewriting and suppressing UMLS terms for improved biomedical term identification
title Rewriting and suppressing UMLS terms for improved biomedical term identification
title_full Rewriting and suppressing UMLS terms for improved biomedical term identification
title_fullStr Rewriting and suppressing UMLS terms for improved biomedical term identification
title_full_unstemmed Rewriting and suppressing UMLS terms for improved biomedical term identification
title_short Rewriting and suppressing UMLS terms for improved biomedical term identification
title_sort rewriting and suppressing umls terms for improved biomedical term identification
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895736/
https://www.ncbi.nlm.nih.gov/pubmed/20618981
http://dx.doi.org/10.1186/2041-1480-1-5
work_keys_str_mv AT hettnekristinam rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification
AT vanmulligenerikm rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification
AT schuemiemartijnj rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification
AT schijvenaarsbobja rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification
AT korsjana rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification