Cargando…
The OpenDeID corpus for patient de-identification
For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic h...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8497517/ https://www.ncbi.nlm.nih.gov/pubmed/34620985 http://dx.doi.org/10.1038/s41598-021-99554-9 |
_version_ | 1784579971012886528 |
---|---|
author | Jonnagaddala, Jitendra Chen, Aipeng Batongbacal, Sean Nekkantti, Chandini |
author_facet | Jonnagaddala, Jitendra Chen, Aipeng Batongbacal, Sean Nekkantti, Chandini |
author_sort | Jonnagaddala, Jitendra |
collection | PubMed |
description | For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers. |
format | Online Article Text |
id | pubmed-8497517 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-84975172021-10-12 The OpenDeID corpus for patient de-identification Jonnagaddala, Jitendra Chen, Aipeng Batongbacal, Sean Nekkantti, Chandini Sci Rep Article For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers. Nature Publishing Group UK 2021-10-07 /pmc/articles/PMC8497517/ /pubmed/34620985 http://dx.doi.org/10.1038/s41598-021-99554-9 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Jonnagaddala, Jitendra Chen, Aipeng Batongbacal, Sean Nekkantti, Chandini The OpenDeID corpus for patient de-identification |
title | The OpenDeID corpus for patient de-identification |
title_full | The OpenDeID corpus for patient de-identification |
title_fullStr | The OpenDeID corpus for patient de-identification |
title_full_unstemmed | The OpenDeID corpus for patient de-identification |
title_short | The OpenDeID corpus for patient de-identification |
title_sort | opendeid corpus for patient de-identification |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8497517/ https://www.ncbi.nlm.nih.gov/pubmed/34620985 http://dx.doi.org/10.1038/s41598-021-99554-9 |
work_keys_str_mv | AT jonnagaddalajitendra theopendeidcorpusforpatientdeidentification AT chenaipeng theopendeidcorpusforpatientdeidentification AT batongbacalsean theopendeidcorpusforpatientdeidentification AT nekkanttichandini theopendeidcorpusforpatientdeidentification AT jonnagaddalajitendra opendeidcorpusforpatientdeidentification AT chenaipeng opendeidcorpusforpatientdeidentification AT batongbacalsean opendeidcorpusforpatientdeidentification AT nekkanttichandini opendeidcorpusforpatientdeidentification |