Cargando…
Customization scenarios for de-identification of clinical notes
BACKGROUND: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance...
Autores principales: | , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6993314/ https://www.ncbi.nlm.nih.gov/pubmed/32000770 http://dx.doi.org/10.1186/s12911-020-1026-2 |
_version_ | 1783493005071613952 |
---|---|
author | Hartman, Tzvika Howell, Michael D. Dean, Jeff Hoory, Shlomo Slyper, Ronit Laish, Itay Gilon, Oren Vainstein, Danny Corrado, Greg Chou, Katherine Po, Ming Jack Williams, Jutta Ellis, Scott Bee, Gavin Hassidim, Avinatan Amira, Rony Beryozkin, Genady Szpektor, Idan Matias, Yossi |
author_facet | Hartman, Tzvika Howell, Michael D. Dean, Jeff Hoory, Shlomo Slyper, Ronit Laish, Itay Gilon, Oren Vainstein, Danny Corrado, Greg Chou, Katherine Po, Ming Jack Williams, Jutta Ellis, Scott Bee, Gavin Hassidim, Avinatan Amira, Rony Beryozkin, Genady Szpektor, Idan Matias, Yossi |
author_sort | Hartman, Tzvika |
collection | PubMed |
description | BACKGROUND: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. OBJECTIVE: We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. METHODS: We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. RESULTS: Fully customized systems remove 97–99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. CONCLUSION: Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level. |
format | Online Article Text |
id | pubmed-6993314 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-69933142020-02-04 Customization scenarios for de-identification of clinical notes Hartman, Tzvika Howell, Michael D. Dean, Jeff Hoory, Shlomo Slyper, Ronit Laish, Itay Gilon, Oren Vainstein, Danny Corrado, Greg Chou, Katherine Po, Ming Jack Williams, Jutta Ellis, Scott Bee, Gavin Hassidim, Avinatan Amira, Rony Beryozkin, Genady Szpektor, Idan Matias, Yossi BMC Med Inform Decis Mak Research Article BACKGROUND: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. OBJECTIVE: We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. METHODS: We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. RESULTS: Fully customized systems remove 97–99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. CONCLUSION: Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level. BioMed Central 2020-01-30 /pmc/articles/PMC6993314/ /pubmed/32000770 http://dx.doi.org/10.1186/s12911-020-1026-2 Text en © The Author(s). 2020 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Hartman, Tzvika Howell, Michael D. Dean, Jeff Hoory, Shlomo Slyper, Ronit Laish, Itay Gilon, Oren Vainstein, Danny Corrado, Greg Chou, Katherine Po, Ming Jack Williams, Jutta Ellis, Scott Bee, Gavin Hassidim, Avinatan Amira, Rony Beryozkin, Genady Szpektor, Idan Matias, Yossi Customization scenarios for de-identification of clinical notes |
title | Customization scenarios for de-identification of clinical notes |
title_full | Customization scenarios for de-identification of clinical notes |
title_fullStr | Customization scenarios for de-identification of clinical notes |
title_full_unstemmed | Customization scenarios for de-identification of clinical notes |
title_short | Customization scenarios for de-identification of clinical notes |
title_sort | customization scenarios for de-identification of clinical notes |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6993314/ https://www.ncbi.nlm.nih.gov/pubmed/32000770 http://dx.doi.org/10.1186/s12911-020-1026-2 |
work_keys_str_mv | AT hartmantzvika customizationscenariosfordeidentificationofclinicalnotes AT howellmichaeld customizationscenariosfordeidentificationofclinicalnotes AT deanjeff customizationscenariosfordeidentificationofclinicalnotes AT hooryshlomo customizationscenariosfordeidentificationofclinicalnotes AT slyperronit customizationscenariosfordeidentificationofclinicalnotes AT laishitay customizationscenariosfordeidentificationofclinicalnotes AT gilonoren customizationscenariosfordeidentificationofclinicalnotes AT vainsteindanny customizationscenariosfordeidentificationofclinicalnotes AT corradogreg customizationscenariosfordeidentificationofclinicalnotes AT choukatherine customizationscenariosfordeidentificationofclinicalnotes AT pomingjack customizationscenariosfordeidentificationofclinicalnotes AT williamsjutta customizationscenariosfordeidentificationofclinicalnotes AT ellisscott customizationscenariosfordeidentificationofclinicalnotes AT beegavin customizationscenariosfordeidentificationofclinicalnotes AT hassidimavinatan customizationscenariosfordeidentificationofclinicalnotes AT amirarony customizationscenariosfordeidentificationofclinicalnotes AT beryozkingenady customizationscenariosfordeidentificationofclinicalnotes AT szpektoridan customizationscenariosfordeidentificationofclinicalnotes AT matiasyossi customizationscenariosfordeidentificationofclinicalnotes |