Cargando…

Automated de-identification of free-text medical records

BACKGROUND: Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before...

Descripción completa

Detalles Bibliográficos
Autores principales:	Neamatullah, Ishna, Douglass, Margaret M, Lehman, Li-wei H, Reisner, Andrew, Villarroel, Mauricio, Long, William J, Szolovits, Peter, Moody, George B, Mark, Roger G, Clifford, Gari D
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2526997/ https://www.ncbi.nlm.nih.gov/pubmed/18652655 http://dx.doi.org/10.1186/1472-6947-8-32

_version_	1782158782230429696
author	Neamatullah, Ishna Douglass, Margaret M Lehman, Li-wei H Reisner, Andrew Villarroel, Mauricio Long, William J Szolovits, Peter Moody, George B Mark, Roger G Clifford, Gari D
author_facet	Neamatullah, Ishna Douglass, Margaret M Lehman, Li-wei H Reisner, Andrew Villarroel, Mauricio Long, William J Szolovits, Peter Moody, George B Mark, Roger G Clifford, Gari D
author_sort	Neamatullah, Ishna
collection	PubMed
description	BACKGROUND: Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification. METHODS: We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus. RESULTS: Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus. CONCLUSION: We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.
format	Text
id	pubmed-2526997
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-25269972008-08-29 Automated de-identification of free-text medical records Neamatullah, Ishna Douglass, Margaret M Lehman, Li-wei H Reisner, Andrew Villarroel, Mauricio Long, William J Szolovits, Peter Moody, George B Mark, Roger G Clifford, Gari D BMC Med Inform Decis Mak Research Article BACKGROUND: Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification. METHODS: We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus. RESULTS: Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus. CONCLUSION: We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm. BioMed Central 2008-07-24 /pmc/articles/PMC2526997/ /pubmed/18652655 http://dx.doi.org/10.1186/1472-6947-8-32 Text en Copyright © 2008 Neamatullah et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Neamatullah, Ishna Douglass, Margaret M Lehman, Li-wei H Reisner, Andrew Villarroel, Mauricio Long, William J Szolovits, Peter Moody, George B Mark, Roger G Clifford, Gari D Automated de-identification of free-text medical records
title	Automated de-identification of free-text medical records
title_full	Automated de-identification of free-text medical records
title_fullStr	Automated de-identification of free-text medical records
title_full_unstemmed	Automated de-identification of free-text medical records
title_short	Automated de-identification of free-text medical records
title_sort	automated de-identification of free-text medical records
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2526997/ https://www.ncbi.nlm.nih.gov/pubmed/18652655 http://dx.doi.org/10.1186/1472-6947-8-32
work_keys_str_mv	AT neamatullahishna automateddeidentificationoffreetextmedicalrecords AT douglassmargaretm automateddeidentificationoffreetextmedicalrecords AT lehmanliweih automateddeidentificationoffreetextmedicalrecords AT reisnerandrew automateddeidentificationoffreetextmedicalrecords AT villarroelmauricio automateddeidentificationoffreetextmedicalrecords AT longwilliamj automateddeidentificationoffreetextmedicalrecords AT szolovitspeter automateddeidentificationoffreetextmedicalrecords AT moodygeorgeb automateddeidentificationoffreetextmedicalrecords AT markrogerg automateddeidentificationoffreetextmedicalrecords AT cliffordgarid automateddeidentificationoffreetextmedicalrecords

Automated de-identification of free-text medical records

Ejemplares similares