Cargando…

The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records

BACKGROUND: Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text c...

Descripción completa

Detalles Bibliográficos
Autores principales: Shah, Anoop D, Martinez, Carlos, Hemingway, Harry
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3483188/
https://www.ncbi.nlm.nih.gov/pubmed/22870911
http://dx.doi.org/10.1186/1472-6947-12-88
_version_ 1782247960416878592
author Shah, Anoop D
Martinez, Carlos
Hemingway, Harry
author_facet Shah, Anoop D
Martinez, Carlos
Hemingway, Harry
author_sort Shah, Anoop D
collection PubMed
description BACKGROUND: Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text can be difficult to use for research if it requires time-consuming manual review. Our aim was to develop an automated method for extracting coded information from free text in electronic patient records. METHODS: We reviewed the electronic patient records in GPRD of a random sample of 3310 patients who died in 2001, to identify the cause of death. We developed a computer program called the Freetext Matching Algorithm (FMA) to map diagnoses in text to the Read Clinical Terminology. The program uses lookup tables of synonyms and phrase patterns to identify diagnoses, dates and selected test results. We tested it on two random samples of free text from GPRD (1000 texts associated with death in 2001, and 1000 general texts from cases and controls in a coronary artery disease study), comparing the output to the U.S. National Library of Medicine’s MetaMap program and the gold standard of manual review. RESULTS: Among 3310 patients registered in the GPRD who died in 2001, the cause of death was recorded in coded form in 38.1% of patients, and in the free text alone in 19.4%. On the 1000 texts associated with death, FMA coded 683 of the 735 positive diagnoses, with precision (positive predictive value) 98.4% (95% confidence interval (CI) 97.2, 99.2) and recall (sensitivity) 92.9% (95% CI 90.8, 94.7). On the general sample, FMA detected 346 of the 447 positive diagnoses, with precision 91.5% (95% CI 88.3, 94.1) and recall 77.4% (95% CI 73.2, 81.2), which was similar to MetaMap. CONCLUSIONS: We have developed an algorithm to extract coded information from free text in GP records with good precision. It may facilitate research using free text in electronic patient records, particularly for extracting the cause of death.
format Online
Article
Text
id pubmed-3483188
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34831882012-10-30 The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records Shah, Anoop D Martinez, Carlos Hemingway, Harry BMC Med Inform Decis Mak Research Article BACKGROUND: Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text can be difficult to use for research if it requires time-consuming manual review. Our aim was to develop an automated method for extracting coded information from free text in electronic patient records. METHODS: We reviewed the electronic patient records in GPRD of a random sample of 3310 patients who died in 2001, to identify the cause of death. We developed a computer program called the Freetext Matching Algorithm (FMA) to map diagnoses in text to the Read Clinical Terminology. The program uses lookup tables of synonyms and phrase patterns to identify diagnoses, dates and selected test results. We tested it on two random samples of free text from GPRD (1000 texts associated with death in 2001, and 1000 general texts from cases and controls in a coronary artery disease study), comparing the output to the U.S. National Library of Medicine’s MetaMap program and the gold standard of manual review. RESULTS: Among 3310 patients registered in the GPRD who died in 2001, the cause of death was recorded in coded form in 38.1% of patients, and in the free text alone in 19.4%. On the 1000 texts associated with death, FMA coded 683 of the 735 positive diagnoses, with precision (positive predictive value) 98.4% (95% confidence interval (CI) 97.2, 99.2) and recall (sensitivity) 92.9% (95% CI 90.8, 94.7). On the general sample, FMA detected 346 of the 447 positive diagnoses, with precision 91.5% (95% CI 88.3, 94.1) and recall 77.4% (95% CI 73.2, 81.2), which was similar to MetaMap. CONCLUSIONS: We have developed an algorithm to extract coded information from free text in GP records with good precision. It may facilitate research using free text in electronic patient records, particularly for extracting the cause of death. BioMed Central 2012-08-07 /pmc/articles/PMC3483188/ /pubmed/22870911 http://dx.doi.org/10.1186/1472-6947-12-88 Text en Copyright ©2012 Shah et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Shah, Anoop D
Martinez, Carlos
Hemingway, Harry
The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
title The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
title_full The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
title_fullStr The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
title_full_unstemmed The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
title_short The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
title_sort freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3483188/
https://www.ncbi.nlm.nih.gov/pubmed/22870911
http://dx.doi.org/10.1186/1472-6947-12-88
work_keys_str_mv AT shahanoopd thefreetextmatchingalgorithmacomputerprogramtoextractdiagnosesandcausesofdeathfromunstructuredtextinelectronichealthrecords
AT martinezcarlos thefreetextmatchingalgorithmacomputerprogramtoextractdiagnosesandcausesofdeathfromunstructuredtextinelectronichealthrecords
AT hemingwayharry thefreetextmatchingalgorithmacomputerprogramtoextractdiagnosesandcausesofdeathfromunstructuredtextinelectronichealthrecords
AT shahanoopd freetextmatchingalgorithmacomputerprogramtoextractdiagnosesandcausesofdeathfromunstructuredtextinelectronichealthrecords
AT martinezcarlos freetextmatchingalgorithmacomputerprogramtoextractdiagnosesandcausesofdeathfromunstructuredtextinelectronichealthrecords
AT hemingwayharry freetextmatchingalgorithmacomputerprogramtoextractdiagnosesandcausesofdeathfromunstructuredtextinelectronichealthrecords