Cargando…

Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis

OBJECTIVE: To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. DESIGN: Based on the occurrences of UMLS terms in a 51 million d...

Descripción completa

Detalles Bibliográficos
Autores principales: Wu, Stephen T, Liu, Hongfang, Li, Dingcheng, Tao, Cui, Musen, Mark A, Chute, Christopher G, Shah, Nigam H
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Group 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3392861/
https://www.ncbi.nlm.nih.gov/pubmed/22493050
http://dx.doi.org/10.1136/amiajnl-2011-000744
_version_ 1782237659940257792
author Wu, Stephen T
Liu, Hongfang
Li, Dingcheng
Tao, Cui
Musen, Mark A
Chute, Christopher G
Shah, Nigam H
author_facet Wu, Stephen T
Liu, Hongfang
Li, Dingcheng
Tao, Cui
Musen, Mark A
Chute, Christopher G
Shah, Nigam H
author_sort Wu, Stephen T
collection PubMed
description OBJECTIVE: To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. DESIGN: Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. RESULTS: For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. CONCLUSION: The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.
format Online
Article
Text
id pubmed-3392861
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BMJ Group
record_format MEDLINE/PubMed
spelling pubmed-33928612012-07-10 Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis Wu, Stephen T Liu, Hongfang Li, Dingcheng Tao, Cui Musen, Mark A Chute, Christopher G Shah, Nigam H J Am Med Inform Assoc Research and Applications OBJECTIVE: To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. DESIGN: Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. RESULTS: For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. CONCLUSION: The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain. BMJ Group 2012-04-04 2012-06 /pmc/articles/PMC3392861/ /pubmed/22493050 http://dx.doi.org/10.1136/amiajnl-2011-000744 Text en © 2012, Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.
spellingShingle Research and Applications
Wu, Stephen T
Liu, Hongfang
Li, Dingcheng
Tao, Cui
Musen, Mark A
Chute, Christopher G
Shah, Nigam H
Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis
title Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis
title_full Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis
title_fullStr Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis
title_full_unstemmed Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis
title_short Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis
title_sort unified medical language system term occurrences in clinical notes: a large-scale corpus analysis
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3392861/
https://www.ncbi.nlm.nih.gov/pubmed/22493050
http://dx.doi.org/10.1136/amiajnl-2011-000744
work_keys_str_mv AT wustephent unifiedmedicallanguagesystemtermoccurrencesinclinicalnotesalargescalecorpusanalysis
AT liuhongfang unifiedmedicallanguagesystemtermoccurrencesinclinicalnotesalargescalecorpusanalysis
AT lidingcheng unifiedmedicallanguagesystemtermoccurrencesinclinicalnotesalargescalecorpusanalysis
AT taocui unifiedmedicallanguagesystemtermoccurrencesinclinicalnotesalargescalecorpusanalysis
AT musenmarka unifiedmedicallanguagesystemtermoccurrencesinclinicalnotesalargescalecorpusanalysis
AT chutechristopherg unifiedmedicallanguagesystemtermoccurrencesinclinicalnotesalargescalecorpusanalysis
AT shahnigamh unifiedmedicallanguagesystemtermoccurrencesinclinicalnotesalargescalecorpusanalysis