Cargando…

Improved de-identification of physician notes through integrative modeling of both public and private medical text

BACKGROUND: Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process tha...

Descripción completa

Detalles Bibliográficos
Autores principales:	McMurry, Andrew J, Fitch, Britt, Savova, Guergana, Kohane, Isaac S, Reis, Ben Y
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3907029/ https://www.ncbi.nlm.nih.gov/pubmed/24083569 http://dx.doi.org/10.1186/1472-6947-13-112

_version_	1782301557288599552
author	McMurry, Andrew J Fitch, Britt Savova, Guergana Kohane, Isaac S Reis, Ben Y
author_facet	McMurry, Andrew J Fitch, Britt Savova, Guergana Kohane, Isaac S Reis, Ben Y
author_sort	McMurry, Andrew J
collection	PubMed
description	BACKGROUND: Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in addition to training a model to recognize Protected Health Information (PHI) within private physician notes, we reverse the problem and train a model to recognize non-PHI words and phrases that appear in public medical texts. METHODS: Public and private medical text sources were analyzed to distinguish common medical words and phrases from Protected Health Information. Patient identifiers are generally nouns and numbers that appear infrequently in medical literature. To quantify this relationship, term frequencies and part of speech tags were compared between journal publications and physician notes. Standard medical concepts and phrases were then examined across ten medical dictionaries. Lists and rules were included from the US census database and previously published studies. In total, 28 features were used to train decision tree classifiers. RESULTS: The model successfully recalled 98% of PHI tokens from 220 discharge summaries. Cost sensitive classification was used to weight recall over precision (98% F10 score, 76% F1 score). More than half of the false negatives were the word “of” appearing in a hospital name. All patient names, phone numbers, and home addresses were at least partially redacted. Medical concepts such as “elevated white blood cell count” were informative for de-identification. The results exceed the previously approved criteria established by four Institutional Review Boards. CONCLUSIONS: The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI. The data and algorithms reported here are made freely available for evaluation and improvement.
format	Online Article Text
id	pubmed-3907029
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-39070292014-02-12 Improved de-identification of physician notes through integrative modeling of both public and private medical text McMurry, Andrew J Fitch, Britt Savova, Guergana Kohane, Isaac S Reis, Ben Y BMC Med Inform Decis Mak Research Article BACKGROUND: Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in addition to training a model to recognize Protected Health Information (PHI) within private physician notes, we reverse the problem and train a model to recognize non-PHI words and phrases that appear in public medical texts. METHODS: Public and private medical text sources were analyzed to distinguish common medical words and phrases from Protected Health Information. Patient identifiers are generally nouns and numbers that appear infrequently in medical literature. To quantify this relationship, term frequencies and part of speech tags were compared between journal publications and physician notes. Standard medical concepts and phrases were then examined across ten medical dictionaries. Lists and rules were included from the US census database and previously published studies. In total, 28 features were used to train decision tree classifiers. RESULTS: The model successfully recalled 98% of PHI tokens from 220 discharge summaries. Cost sensitive classification was used to weight recall over precision (98% F10 score, 76% F1 score). More than half of the false negatives were the word “of” appearing in a hospital name. All patient names, phone numbers, and home addresses were at least partially redacted. Medical concepts such as “elevated white blood cell count” were informative for de-identification. The results exceed the previously approved criteria established by four Institutional Review Boards. CONCLUSIONS: The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI. The data and algorithms reported here are made freely available for evaluation and improvement. BioMed Central 2013-10-02 /pmc/articles/PMC3907029/ /pubmed/24083569 http://dx.doi.org/10.1186/1472-6947-13-112 Text en Copyright © 2013 McMurry et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article McMurry, Andrew J Fitch, Britt Savova, Guergana Kohane, Isaac S Reis, Ben Y Improved de-identification of physician notes through integrative modeling of both public and private medical text
title	Improved de-identification of physician notes through integrative modeling of both public and private medical text
title_full	Improved de-identification of physician notes through integrative modeling of both public and private medical text
title_fullStr	Improved de-identification of physician notes through integrative modeling of both public and private medical text
title_full_unstemmed	Improved de-identification of physician notes through integrative modeling of both public and private medical text
title_short	Improved de-identification of physician notes through integrative modeling of both public and private medical text
title_sort	improved de-identification of physician notes through integrative modeling of both public and private medical text
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3907029/ https://www.ncbi.nlm.nih.gov/pubmed/24083569 http://dx.doi.org/10.1186/1472-6947-13-112
work_keys_str_mv	AT mcmurryandrewj improveddeidentificationofphysiciannotesthroughintegrativemodelingofbothpublicandprivatemedicaltext AT fitchbritt improveddeidentificationofphysiciannotesthroughintegrativemodelingofbothpublicandprivatemedicaltext AT savovaguergana improveddeidentificationofphysiciannotesthroughintegrativemodelingofbothpublicandprivatemedicaltext AT kohaneisaacs improveddeidentificationofphysiciannotesthroughintegrativemodelingofbothpublicandprivatemedicaltext AT reisbeny improveddeidentificationofphysiciannotesthroughintegrativemodelingofbothpublicandprivatemedicaltext

Improved de-identification of physician notes through integrative modeling of both public and private medical text

Ejemplares similares