Cargando…
Rewriting and suppressing UMLS terms for improved biomedical term identification
BACKGROUND: Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine t...
Autores principales: | , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895736/ https://www.ncbi.nlm.nih.gov/pubmed/20618981 http://dx.doi.org/10.1186/2041-1480-1-5 |
_version_ | 1782183286529851392 |
---|---|
author | Hettne, Kristina M van Mulligen, Erik M Schuemie, Martijn J Schijvenaars, Bob JA Kors, Jan A |
author_facet | Hettne, Kristina M van Mulligen, Erik M Schuemie, Martijn J Schijvenaars, Bob JA Kors, Jan A |
author_sort | Hettne, Kristina M |
collection | PubMed |
description | BACKGROUND: Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. RESULTS: Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. CONCLUSIONS: We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper. |
format | Text |
id | pubmed-2895736 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-28957362010-07-06 Rewriting and suppressing UMLS terms for improved biomedical term identification Hettne, Kristina M van Mulligen, Erik M Schuemie, Martijn J Schijvenaars, Bob JA Kors, Jan A J Biomed Semantics Research BACKGROUND: Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. RESULTS: Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. CONCLUSIONS: We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper. BioMed Central 2010-03-31 /pmc/articles/PMC2895736/ /pubmed/20618981 http://dx.doi.org/10.1186/2041-1480-1-5 Text en Copyright ©2010 Hettne et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Hettne, Kristina M van Mulligen, Erik M Schuemie, Martijn J Schijvenaars, Bob JA Kors, Jan A Rewriting and suppressing UMLS terms for improved biomedical term identification |
title | Rewriting and suppressing UMLS terms for improved biomedical term identification |
title_full | Rewriting and suppressing UMLS terms for improved biomedical term identification |
title_fullStr | Rewriting and suppressing UMLS terms for improved biomedical term identification |
title_full_unstemmed | Rewriting and suppressing UMLS terms for improved biomedical term identification |
title_short | Rewriting and suppressing UMLS terms for improved biomedical term identification |
title_sort | rewriting and suppressing umls terms for improved biomedical term identification |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895736/ https://www.ncbi.nlm.nih.gov/pubmed/20618981 http://dx.doi.org/10.1186/2041-1480-1-5 |
work_keys_str_mv | AT hettnekristinam rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification AT vanmulligenerikm rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification AT schuemiemartijnj rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification AT schijvenaarsbobja rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification AT korsjana rewritingandsuppressingumlstermsforimprovedbiomedicaltermidentification |