Cargando…
Development and evaluation of an open source software tool for deidentification of pathology reports
BACKGROUND: Electronic medical records, including pathology reports, are often used for research purposes. Currently, there are few programs freely available to remove identifiers while leaving the remainder of the pathology report text intact. Our goal was to produce an open source, Health Insuranc...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1421388/ https://www.ncbi.nlm.nih.gov/pubmed/16515714 http://dx.doi.org/10.1186/1472-6947-6-12 |
_version_ | 1782127163150958592 |
---|---|
author | Beckwith, Bruce A Mahaadevan, Rajeshwarri Balis, Ulysses J Kuo, Frank |
author_facet | Beckwith, Bruce A Mahaadevan, Rajeshwarri Balis, Ulysses J Kuo, Frank |
author_sort | Beckwith, Bruce A |
collection | PubMed |
description | BACKGROUND: Electronic medical records, including pathology reports, are often used for research purposes. Currently, there are few programs freely available to remove identifiers while leaving the remainder of the pathology report text intact. Our goal was to produce an open source, Health Insurance Portability and Accountability Act (HIPAA) compliant, deidentification tool tailored for pathology reports. We designed a three-step process for removing potential identifiers. The first step is to look for identifiers known to be associated with the patient, such as name, medical record number, pathology accession number, etc. Next, a series of pattern matches look for predictable patterns likely to represent identifying data; such as dates, accession numbers and addresses as well as patient, institution and physician names. Finally, individual words are compared with a database of proper names and geographic locations. Pathology reports from three institutions were used to design and test the algorithms. The software was improved iteratively on training sets until it exhibited good performance. 1800 new pathology reports were then processed. Each report was reviewed manually before and after deidentification to catalog all identifiers and note those that were not removed. RESULTS: 1254 (69.7 %) of 1800 pathology reports contained identifiers in the body of the report. 3439 (98.3%) of 3499 unique identifiers in the test set were removed. Only 19 HIPAA-specified identifiers (mainly consult accession numbers and misspelled names) were missed. Of 41 non-HIPAA identifiers missed, the majority were partial institutional addresses and ages. Outside consultation case reports typically contain numerous identifiers and were the most challenging to deidentify comprehensively. There was variation in performance among reports from the three institutions, highlighting the need for site-specific customization, which is easily accomplished with our tool. CONCLUSION: We have demonstrated that it is possible to create an open-source deidentification program which performs well on free-text pathology reports. |
format | Text |
id | pubmed-1421388 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-14213882006-04-01 Development and evaluation of an open source software tool for deidentification of pathology reports Beckwith, Bruce A Mahaadevan, Rajeshwarri Balis, Ulysses J Kuo, Frank BMC Med Inform Decis Mak Software BACKGROUND: Electronic medical records, including pathology reports, are often used for research purposes. Currently, there are few programs freely available to remove identifiers while leaving the remainder of the pathology report text intact. Our goal was to produce an open source, Health Insurance Portability and Accountability Act (HIPAA) compliant, deidentification tool tailored for pathology reports. We designed a three-step process for removing potential identifiers. The first step is to look for identifiers known to be associated with the patient, such as name, medical record number, pathology accession number, etc. Next, a series of pattern matches look for predictable patterns likely to represent identifying data; such as dates, accession numbers and addresses as well as patient, institution and physician names. Finally, individual words are compared with a database of proper names and geographic locations. Pathology reports from three institutions were used to design and test the algorithms. The software was improved iteratively on training sets until it exhibited good performance. 1800 new pathology reports were then processed. Each report was reviewed manually before and after deidentification to catalog all identifiers and note those that were not removed. RESULTS: 1254 (69.7 %) of 1800 pathology reports contained identifiers in the body of the report. 3439 (98.3%) of 3499 unique identifiers in the test set were removed. Only 19 HIPAA-specified identifiers (mainly consult accession numbers and misspelled names) were missed. Of 41 non-HIPAA identifiers missed, the majority were partial institutional addresses and ages. Outside consultation case reports typically contain numerous identifiers and were the most challenging to deidentify comprehensively. There was variation in performance among reports from the three institutions, highlighting the need for site-specific customization, which is easily accomplished with our tool. CONCLUSION: We have demonstrated that it is possible to create an open-source deidentification program which performs well on free-text pathology reports. BioMed Central 2006-03-06 /pmc/articles/PMC1421388/ /pubmed/16515714 http://dx.doi.org/10.1186/1472-6947-6-12 Text en Copyright © 2006 Beckwith et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software Beckwith, Bruce A Mahaadevan, Rajeshwarri Balis, Ulysses J Kuo, Frank Development and evaluation of an open source software tool for deidentification of pathology reports |
title | Development and evaluation of an open source software tool for deidentification of pathology reports |
title_full | Development and evaluation of an open source software tool for deidentification of pathology reports |
title_fullStr | Development and evaluation of an open source software tool for deidentification of pathology reports |
title_full_unstemmed | Development and evaluation of an open source software tool for deidentification of pathology reports |
title_short | Development and evaluation of an open source software tool for deidentification of pathology reports |
title_sort | development and evaluation of an open source software tool for deidentification of pathology reports |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1421388/ https://www.ncbi.nlm.nih.gov/pubmed/16515714 http://dx.doi.org/10.1186/1472-6947-6-12 |
work_keys_str_mv | AT beckwithbrucea developmentandevaluationofanopensourcesoftwaretoolfordeidentificationofpathologyreports AT mahaadevanrajeshwarri developmentandevaluationofanopensourcesoftwaretoolfordeidentificationofpathologyreports AT balisulyssesj developmentandevaluationofanopensourcesoftwaretoolfordeidentificationofpathologyreports AT kuofrank developmentandevaluationofanopensourcesoftwaretoolfordeidentificationofpathologyreports |