Cargando…

Resources for comparing the speed and performance of medical autocoders

BACKGROUND: Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medi...

Descripción completa

Detalles Bibliográficos
Autor principal: Berman, Jules J
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2004
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC441395/
https://www.ncbi.nlm.nih.gov/pubmed/15198804
http://dx.doi.org/10.1186/1472-6947-4-8
_version_ 1782121539059056640
author Berman, Jules J
author_facet Berman, Jules J
author_sort Berman, Jules J
collection PubMed
description BACKGROUND: Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medical concept to all equivalent representations. A concept search on the term renal cell carcinoma would be expected to find occurrences of hypernephroma, and renal carcinoma (concept equivalents). The purpose of this study is to provide freely available resources to compare speed and performance among different autocoders. These tools consist of: 1) a public domain autocoder written in Perl (a free and open source programming language that installs on any operating system); 2) a nomenclature database derived from the unencumbered subset of the publicly available Unified Medical Language System; 3) a large corpus of autocoded output derived from a publicly available medical text. METHODS: A simple lexical autocoder was written that parses plain-text into a listing of all 1,2,3, and 4-word strings contained in text, assigning a nomenclature code for text strings that match terms in the nomenclature. The nomenclature used is the unencumbered subset of the 2003 Unified Medical Language System (UMLS). The unencumbered subset of UMLS was reduced to exclude homonymous one-word terms and proper names, resulting in a term/code data dictionary containing about a half million medical terms. The Online Mendelian Inheritance in Man (OMIM), a 92+ Megabyte publicly available medical opus, was used as sample medical text for the autocoder. RESULTS: The autocoding Perl script is remarkably short, consisting of just 38 command lines. The 92+ Megabyte OMIM file was completely autocoded in 869 seconds on a 2.4 GHz processor (less than 10 seconds per Megabyte of text). The autocoded output file (9,540,442 bytes) contains 367,963 coded terms from OMIM and is distributed with this manuscript. CONCLUSIONS: A public domain Perl script is provided that can parse through plain-text files of any length, matching concepts against an external nomenclature. The script and associated files can be used freely to compare the speed and performance of autocoding software.
format Text
id pubmed-441395
institution National Center for Biotechnology Information
language English
publishDate 2004
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-4413952004-07-02 Resources for comparing the speed and performance of medical autocoders Berman, Jules J BMC Med Inform Decis Mak Software BACKGROUND: Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medical concept to all equivalent representations. A concept search on the term renal cell carcinoma would be expected to find occurrences of hypernephroma, and renal carcinoma (concept equivalents). The purpose of this study is to provide freely available resources to compare speed and performance among different autocoders. These tools consist of: 1) a public domain autocoder written in Perl (a free and open source programming language that installs on any operating system); 2) a nomenclature database derived from the unencumbered subset of the publicly available Unified Medical Language System; 3) a large corpus of autocoded output derived from a publicly available medical text. METHODS: A simple lexical autocoder was written that parses plain-text into a listing of all 1,2,3, and 4-word strings contained in text, assigning a nomenclature code for text strings that match terms in the nomenclature. The nomenclature used is the unencumbered subset of the 2003 Unified Medical Language System (UMLS). The unencumbered subset of UMLS was reduced to exclude homonymous one-word terms and proper names, resulting in a term/code data dictionary containing about a half million medical terms. The Online Mendelian Inheritance in Man (OMIM), a 92+ Megabyte publicly available medical opus, was used as sample medical text for the autocoder. RESULTS: The autocoding Perl script is remarkably short, consisting of just 38 command lines. The 92+ Megabyte OMIM file was completely autocoded in 869 seconds on a 2.4 GHz processor (less than 10 seconds per Megabyte of text). The autocoded output file (9,540,442 bytes) contains 367,963 coded terms from OMIM and is distributed with this manuscript. CONCLUSIONS: A public domain Perl script is provided that can parse through plain-text files of any length, matching concepts against an external nomenclature. The script and associated files can be used freely to compare the speed and performance of autocoding software. BioMed Central 2004-06-15 /pmc/articles/PMC441395/ /pubmed/15198804 http://dx.doi.org/10.1186/1472-6947-4-8 Text en Copyright © 2004 Berman; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.
spellingShingle Software
Berman, Jules J
Resources for comparing the speed and performance of medical autocoders
title Resources for comparing the speed and performance of medical autocoders
title_full Resources for comparing the speed and performance of medical autocoders
title_fullStr Resources for comparing the speed and performance of medical autocoders
title_full_unstemmed Resources for comparing the speed and performance of medical autocoders
title_short Resources for comparing the speed and performance of medical autocoders
title_sort resources for comparing the speed and performance of medical autocoders
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC441395/
https://www.ncbi.nlm.nih.gov/pubmed/15198804
http://dx.doi.org/10.1186/1472-6947-4-8
work_keys_str_mv AT bermanjulesj resourcesforcomparingthespeedandperformanceofmedicalautocoders