Cargando…

Resources for comparing the speed and performance of medical autocoders

BACKGROUND: Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medi...

Descripción completa

Detalles Bibliográficos
Autor principal:	Berman, Jules J
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2004
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC441395/ https://www.ncbi.nlm.nih.gov/pubmed/15198804 http://dx.doi.org/10.1186/1472-6947-4-8

_version_	1782121539059056640
author	Berman, Jules J
author_facet	Berman, Jules J
author_sort	Berman, Jules J
collection	PubMed
description	BACKGROUND: Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medical concept to all equivalent representations. A concept search on the term renal cell carcinoma would be expected to find occurrences of hypernephroma, and renal carcinoma (concept equivalents). The purpose of this study is to provide freely available resources to compare speed and performance among different autocoders. These tools consist of: 1) a public domain autocoder written in Perl (a free and open source programming language that installs on any operating system); 2) a nomenclature database derived from the unencumbered subset of the publicly available Unified Medical Language System; 3) a large corpus of autocoded output derived from a publicly available medical text. METHODS: A simple lexical autocoder was written that parses plain-text into a listing of all 1,2,3, and 4-word strings contained in text, assigning a nomenclature code for text strings that match terms in the nomenclature. The nomenclature used is the unencumbered subset of the 2003 Unified Medical Language System (UMLS). The unencumbered subset of UMLS was reduced to exclude homonymous one-word terms and proper names, resulting in a term/code data dictionary containing about a half million medical terms. The Online Mendelian Inheritance in Man (OMIM), a 92+ Megabyte publicly available medical opus, was used as sample medical text for the autocoder. RESULTS: The autocoding Perl script is remarkably short, consisting of just 38 command lines. The 92+ Megabyte OMIM file was completely autocoded in 869 seconds on a 2.4 GHz processor (less than 10 seconds per Megabyte of text). The autocoded output file (9,540,442 bytes) contains 367,963 coded terms from OMIM and is distributed with this manuscript. CONCLUSIONS: A public domain Perl script is provided that can parse through plain-text files of any length, matching concepts against an external nomenclature. The script and associated files can be used freely to compare the speed and performance of autocoding software.
format	Text
id	pubmed-441395
institution	National Center for Biotechnology Information
language	English
publishDate	2004
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-4413952004-07-02 Resources for comparing the speed and performance of medical autocoders Berman, Jules J BMC Med Inform Decis Mak Software BACKGROUND: Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medical concept to all equivalent representations. A concept search on the term renal cell carcinoma would be expected to find occurrences of hypernephroma, and renal carcinoma (concept equivalents). The purpose of this study is to provide freely available resources to compare speed and performance among different autocoders. These tools consist of: 1) a public domain autocoder written in Perl (a free and open source programming language that installs on any operating system); 2) a nomenclature database derived from the unencumbered subset of the publicly available Unified Medical Language System; 3) a large corpus of autocoded output derived from a publicly available medical text. METHODS: A simple lexical autocoder was written that parses plain-text into a listing of all 1,2,3, and 4-word strings contained in text, assigning a nomenclature code for text strings that match terms in the nomenclature. The nomenclature used is the unencumbered subset of the 2003 Unified Medical Language System (UMLS). The unencumbered subset of UMLS was reduced to exclude homonymous one-word terms and proper names, resulting in a term/code data dictionary containing about a half million medical terms. The Online Mendelian Inheritance in Man (OMIM), a 92+ Megabyte publicly available medical opus, was used as sample medical text for the autocoder. RESULTS: The autocoding Perl script is remarkably short, consisting of just 38 command lines. The 92+ Megabyte OMIM file was completely autocoded in 869 seconds on a 2.4 GHz processor (less than 10 seconds per Megabyte of text). The autocoded output file (9,540,442 bytes) contains 367,963 coded terms from OMIM and is distributed with this manuscript. CONCLUSIONS: A public domain Perl script is provided that can parse through plain-text files of any length, matching concepts against an external nomenclature. The script and associated files can be used freely to compare the speed and performance of autocoding software. BioMed Central 2004-06-15 /pmc/articles/PMC441395/ /pubmed/15198804 http://dx.doi.org/10.1186/1472-6947-4-8 Text en Copyright © 2004 Berman; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.
spellingShingle	Software Berman, Jules J Resources for comparing the speed and performance of medical autocoders
title	Resources for comparing the speed and performance of medical autocoders
title_full	Resources for comparing the speed and performance of medical autocoders
title_fullStr	Resources for comparing the speed and performance of medical autocoders
title_full_unstemmed	Resources for comparing the speed and performance of medical autocoders
title_short	Resources for comparing the speed and performance of medical autocoders
title_sort	resources for comparing the speed and performance of medical autocoders
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC441395/ https://www.ncbi.nlm.nih.gov/pubmed/15198804 http://dx.doi.org/10.1186/1472-6947-4-8
work_keys_str_mv	AT bermanjulesj resourcesforcomparingthespeedandperformanceofmedicalautocoders

Resources for comparing the speed and performance of medical autocoders

Ejemplares similares