Cargando…

The OpenDeID corpus for patient de-identification

For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic h...

Descripción completa

Detalles Bibliográficos
Autores principales: Jonnagaddala, Jitendra, Chen, Aipeng, Batongbacal, Sean, Nekkantti, Chandini
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8497517/
https://www.ncbi.nlm.nih.gov/pubmed/34620985
http://dx.doi.org/10.1038/s41598-021-99554-9
_version_ 1784579971012886528
author Jonnagaddala, Jitendra
Chen, Aipeng
Batongbacal, Sean
Nekkantti, Chandini
author_facet Jonnagaddala, Jitendra
Chen, Aipeng
Batongbacal, Sean
Nekkantti, Chandini
author_sort Jonnagaddala, Jitendra
collection PubMed
description For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.
format Online
Article
Text
id pubmed-8497517
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-84975172021-10-12 The OpenDeID corpus for patient de-identification Jonnagaddala, Jitendra Chen, Aipeng Batongbacal, Sean Nekkantti, Chandini Sci Rep Article For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers. Nature Publishing Group UK 2021-10-07 /pmc/articles/PMC8497517/ /pubmed/34620985 http://dx.doi.org/10.1038/s41598-021-99554-9 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Jonnagaddala, Jitendra
Chen, Aipeng
Batongbacal, Sean
Nekkantti, Chandini
The OpenDeID corpus for patient de-identification
title The OpenDeID corpus for patient de-identification
title_full The OpenDeID corpus for patient de-identification
title_fullStr The OpenDeID corpus for patient de-identification
title_full_unstemmed The OpenDeID corpus for patient de-identification
title_short The OpenDeID corpus for patient de-identification
title_sort opendeid corpus for patient de-identification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8497517/
https://www.ncbi.nlm.nih.gov/pubmed/34620985
http://dx.doi.org/10.1038/s41598-021-99554-9
work_keys_str_mv AT jonnagaddalajitendra theopendeidcorpusforpatientdeidentification
AT chenaipeng theopendeidcorpusforpatientdeidentification
AT batongbacalsean theopendeidcorpusforpatientdeidentification
AT nekkanttichandini theopendeidcorpusforpatientdeidentification
AT jonnagaddalajitendra opendeidcorpusforpatientdeidentification
AT chenaipeng opendeidcorpusforpatientdeidentification
AT batongbacalsean opendeidcorpusforpatientdeidentification
AT nekkanttichandini opendeidcorpusforpatientdeidentification