Cargando…

Customization scenarios for de-identification of clinical notes

BACKGROUND: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance...

Descripción completa

Detalles Bibliográficos
Autores principales: Hartman, Tzvika, Howell, Michael D., Dean, Jeff, Hoory, Shlomo, Slyper, Ronit, Laish, Itay, Gilon, Oren, Vainstein, Danny, Corrado, Greg, Chou, Katherine, Po, Ming Jack, Williams, Jutta, Ellis, Scott, Bee, Gavin, Hassidim, Avinatan, Amira, Rony, Beryozkin, Genady, Szpektor, Idan, Matias, Yossi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6993314/
https://www.ncbi.nlm.nih.gov/pubmed/32000770
http://dx.doi.org/10.1186/s12911-020-1026-2
_version_ 1783493005071613952
author Hartman, Tzvika
Howell, Michael D.
Dean, Jeff
Hoory, Shlomo
Slyper, Ronit
Laish, Itay
Gilon, Oren
Vainstein, Danny
Corrado, Greg
Chou, Katherine
Po, Ming Jack
Williams, Jutta
Ellis, Scott
Bee, Gavin
Hassidim, Avinatan
Amira, Rony
Beryozkin, Genady
Szpektor, Idan
Matias, Yossi
author_facet Hartman, Tzvika
Howell, Michael D.
Dean, Jeff
Hoory, Shlomo
Slyper, Ronit
Laish, Itay
Gilon, Oren
Vainstein, Danny
Corrado, Greg
Chou, Katherine
Po, Ming Jack
Williams, Jutta
Ellis, Scott
Bee, Gavin
Hassidim, Avinatan
Amira, Rony
Beryozkin, Genady
Szpektor, Idan
Matias, Yossi
author_sort Hartman, Tzvika
collection PubMed
description BACKGROUND: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. OBJECTIVE: We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. METHODS: We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. RESULTS: Fully customized systems remove 97–99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. CONCLUSION: Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level.
format Online
Article
Text
id pubmed-6993314
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-69933142020-02-04 Customization scenarios for de-identification of clinical notes Hartman, Tzvika Howell, Michael D. Dean, Jeff Hoory, Shlomo Slyper, Ronit Laish, Itay Gilon, Oren Vainstein, Danny Corrado, Greg Chou, Katherine Po, Ming Jack Williams, Jutta Ellis, Scott Bee, Gavin Hassidim, Avinatan Amira, Rony Beryozkin, Genady Szpektor, Idan Matias, Yossi BMC Med Inform Decis Mak Research Article BACKGROUND: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. OBJECTIVE: We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. METHODS: We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. RESULTS: Fully customized systems remove 97–99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. CONCLUSION: Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level. BioMed Central 2020-01-30 /pmc/articles/PMC6993314/ /pubmed/32000770 http://dx.doi.org/10.1186/s12911-020-1026-2 Text en © The Author(s). 2020 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Hartman, Tzvika
Howell, Michael D.
Dean, Jeff
Hoory, Shlomo
Slyper, Ronit
Laish, Itay
Gilon, Oren
Vainstein, Danny
Corrado, Greg
Chou, Katherine
Po, Ming Jack
Williams, Jutta
Ellis, Scott
Bee, Gavin
Hassidim, Avinatan
Amira, Rony
Beryozkin, Genady
Szpektor, Idan
Matias, Yossi
Customization scenarios for de-identification of clinical notes
title Customization scenarios for de-identification of clinical notes
title_full Customization scenarios for de-identification of clinical notes
title_fullStr Customization scenarios for de-identification of clinical notes
title_full_unstemmed Customization scenarios for de-identification of clinical notes
title_short Customization scenarios for de-identification of clinical notes
title_sort customization scenarios for de-identification of clinical notes
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6993314/
https://www.ncbi.nlm.nih.gov/pubmed/32000770
http://dx.doi.org/10.1186/s12911-020-1026-2
work_keys_str_mv AT hartmantzvika customizationscenariosfordeidentificationofclinicalnotes
AT howellmichaeld customizationscenariosfordeidentificationofclinicalnotes
AT deanjeff customizationscenariosfordeidentificationofclinicalnotes
AT hooryshlomo customizationscenariosfordeidentificationofclinicalnotes
AT slyperronit customizationscenariosfordeidentificationofclinicalnotes
AT laishitay customizationscenariosfordeidentificationofclinicalnotes
AT gilonoren customizationscenariosfordeidentificationofclinicalnotes
AT vainsteindanny customizationscenariosfordeidentificationofclinicalnotes
AT corradogreg customizationscenariosfordeidentificationofclinicalnotes
AT choukatherine customizationscenariosfordeidentificationofclinicalnotes
AT pomingjack customizationscenariosfordeidentificationofclinicalnotes
AT williamsjutta customizationscenariosfordeidentificationofclinicalnotes
AT ellisscott customizationscenariosfordeidentificationofclinicalnotes
AT beegavin customizationscenariosfordeidentificationofclinicalnotes
AT hassidimavinatan customizationscenariosfordeidentificationofclinicalnotes
AT amirarony customizationscenariosfordeidentificationofclinicalnotes
AT beryozkingenady customizationscenariosfordeidentificationofclinicalnotes
AT szpektoridan customizationscenariosfordeidentificationofclinicalnotes
AT matiasyossi customizationscenariosfordeidentificationofclinicalnotes