Cargando…
A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies
BACKGROUND: Large biomedical data sets have become increasingly important resources for medical researchers. Modern biomedical data sets are annotated with standard terms to describe the data and to support data linking between databases. The largest curated listing of biomedical terms is the the Na...
Autor principal: | |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2003
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC165595/ https://www.ncbi.nlm.nih.gov/pubmed/12809560 http://dx.doi.org/10.1186/1472-6947-3-6 |
_version_ | 1782120842505748480 |
---|---|
author | Berman, Jules J |
author_facet | Berman, Jules J |
author_sort | Berman, Jules J |
collection | PubMed |
description | BACKGROUND: Large biomedical data sets have become increasingly important resources for medical researchers. Modern biomedical data sets are annotated with standard terms to describe the data and to support data linking between databases. The largest curated listing of biomedical terms is the the National Library of Medicine's Unified Medical Language System (UMLS). The UMLS contains more than 2 million biomedical terms collected from nearly 100 medical vocabularies. Many of the vocabularies contained in the UMLS carry restrictions on their use, making it impossible to share or distribute UMLS-annotated research data. However, a subset of the UMLS vocabularies, designated Category 0 by UMLS, can be used to annotate and share data sets without violating the UMLS License Agreement. METHODS: The UMLS Category 0 vocabularies can be extracted from the parent UMLS metathesaurus using a Perl script supplied with this article. There are 43 Category 0 vocabularies that can be used freely for research purposes without violating the UMLS License Agreement. Among the Category 0 vocabularies are: MESH (Medical Subject Headings), NCBI (National Center for Bioinformatics) Taxonomy and ICD-9-CM (International Classification of Diseases-9-Clinical Modifiers). RESULTS: The extraction file containing all Category 0 terms and concepts is 72,581,138 bytes in length and contains 1,029,161 terms. The UMLS Metathesaurus MRCON file (January, 2003) is 151,048,493 bytes in length and contains 2,146,899 terms. Therefore the Category 0 vocabularies, in aggregate, are about half the size of the UMLS metathesaurus. A large publicly available listing of 567,921 different medical phrases were automatically coded using the full UMLS metatathesaurus and the Category 0 vocabularies. There were 545,321 phrases with one or more matches against UMLS terms while 468,785 phrases had one or more matches against the Category 0 terms. This indicates that when the two vocabularies are evaluated by their fitness to find at least one term for a medical phrase, the Category 0 vocabularies performed 86% as well as the complete UMLS metathesaurus. CONCLUSION: The Category 0 vocabularies of UMLS constitute a large nomenclature that can be used by biomedical researchers to annotate biomedical data. These annotated data sets can be distributed for research purposes without violating the UMLS License Agreement. These vocabularies may be of particular importance for sharing heterogeneous data from diverse biomedical data sets. The software tools to extract the Category 0 vocabularies are freely available Perl scripts entered into the public domain and distributed with this article. |
format | Text |
id | pubmed-165595 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2003 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-1655952003-07-20 A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies Berman, Jules J BMC Med Inform Decis Mak Technical Advance BACKGROUND: Large biomedical data sets have become increasingly important resources for medical researchers. Modern biomedical data sets are annotated with standard terms to describe the data and to support data linking between databases. The largest curated listing of biomedical terms is the the National Library of Medicine's Unified Medical Language System (UMLS). The UMLS contains more than 2 million biomedical terms collected from nearly 100 medical vocabularies. Many of the vocabularies contained in the UMLS carry restrictions on their use, making it impossible to share or distribute UMLS-annotated research data. However, a subset of the UMLS vocabularies, designated Category 0 by UMLS, can be used to annotate and share data sets without violating the UMLS License Agreement. METHODS: The UMLS Category 0 vocabularies can be extracted from the parent UMLS metathesaurus using a Perl script supplied with this article. There are 43 Category 0 vocabularies that can be used freely for research purposes without violating the UMLS License Agreement. Among the Category 0 vocabularies are: MESH (Medical Subject Headings), NCBI (National Center for Bioinformatics) Taxonomy and ICD-9-CM (International Classification of Diseases-9-Clinical Modifiers). RESULTS: The extraction file containing all Category 0 terms and concepts is 72,581,138 bytes in length and contains 1,029,161 terms. The UMLS Metathesaurus MRCON file (January, 2003) is 151,048,493 bytes in length and contains 2,146,899 terms. Therefore the Category 0 vocabularies, in aggregate, are about half the size of the UMLS metathesaurus. A large publicly available listing of 567,921 different medical phrases were automatically coded using the full UMLS metatathesaurus and the Category 0 vocabularies. There were 545,321 phrases with one or more matches against UMLS terms while 468,785 phrases had one or more matches against the Category 0 terms. This indicates that when the two vocabularies are evaluated by their fitness to find at least one term for a medical phrase, the Category 0 vocabularies performed 86% as well as the complete UMLS metathesaurus. CONCLUSION: The Category 0 vocabularies of UMLS constitute a large nomenclature that can be used by biomedical researchers to annotate biomedical data. These annotated data sets can be distributed for research purposes without violating the UMLS License Agreement. These vocabularies may be of particular importance for sharing heterogeneous data from diverse biomedical data sets. The software tools to extract the Category 0 vocabularies are freely available Perl scripts entered into the public domain and distributed with this article. BioMed Central 2003-06-16 /pmc/articles/PMC165595/ /pubmed/12809560 http://dx.doi.org/10.1186/1472-6947-3-6 Text en Copyright © 2003 Berman; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. |
spellingShingle | Technical Advance Berman, Jules J A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies |
title | A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies |
title_full | A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies |
title_fullStr | A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies |
title_full_unstemmed | A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies |
title_short | A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies |
title_sort | tool for sharing annotated research data: the "category 0" umls (unified medical language system) vocabularies |
topic | Technical Advance |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC165595/ https://www.ncbi.nlm.nih.gov/pubmed/12809560 http://dx.doi.org/10.1186/1472-6947-3-6 |
work_keys_str_mv | AT bermanjulesj atoolforsharingannotatedresearchdatathecategory0umlsunifiedmedicallanguagesystemvocabularies AT bermanjulesj toolforsharingannotatedresearchdatathecategory0umlsunifiedmedicallanguagesystemvocabularies |