Cargando…

Large-scale evaluation of automated clinical note de-identification and its impact on information extraction

OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MA...

Descripción completa

Detalles Bibliográficos
Autores principales: Deleger, Louise, Molnar, Katalin, Savova, Guergana, Xia, Fei, Lingren, Todd, Li, Qi, Marsolo, Keith, Jegga, Anil, Kaiser, Megan, Stoutenborough, Laura, Solti, Imre
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Group 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3555323/
https://www.ncbi.nlm.nih.gov/pubmed/22859645
http://dx.doi.org/10.1136/amiajnl-2012-001012
_version_ 1782257019812577280
author Deleger, Louise
Molnar, Katalin
Savova, Guergana
Xia, Fei
Lingren, Todd
Li, Qi
Marsolo, Keith
Jegga, Anil
Kaiser, Megan
Stoutenborough, Laura
Solti, Imre
author_facet Deleger, Louise
Molnar, Katalin
Savova, Guergana
Xia, Fei
Lingren, Todd
Li, Qi
Marsolo, Keith
Jegga, Anil
Kaiser, Megan
Stoutenborough, Laura
Solti, Imre
author_sort Deleger, Louise
collection PubMed
description OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MATERIAL AND METHODS: A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated ‘gold standard’. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured. RESULTS: The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction. DISCUSSION AND CONCLUSION: NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively.
format Online
Article
Text
id pubmed-3555323
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BMJ Group
record_format MEDLINE/PubMed
spelling pubmed-35553232013-12-14 Large-scale evaluation of automated clinical note de-identification and its impact on information extraction Deleger, Louise Molnar, Katalin Savova, Guergana Xia, Fei Lingren, Todd Li, Qi Marsolo, Keith Jegga, Anil Kaiser, Megan Stoutenborough, Laura Solti, Imre J Am Med Inform Assoc Focus on Patient Privacy OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MATERIAL AND METHODS: A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated ‘gold standard’. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured. RESULTS: The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction. DISCUSSION AND CONCLUSION: NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively. BMJ Group 2013 /pmc/articles/PMC3555323/ /pubmed/22859645 http://dx.doi.org/10.1136/amiajnl-2012-001012 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/3.0/ and http://creativecommons.org/licenses/by-nc/3.0/legalcode
spellingShingle Focus on Patient Privacy
Deleger, Louise
Molnar, Katalin
Savova, Guergana
Xia, Fei
Lingren, Todd
Li, Qi
Marsolo, Keith
Jegga, Anil
Kaiser, Megan
Stoutenborough, Laura
Solti, Imre
Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
title Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
title_full Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
title_fullStr Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
title_full_unstemmed Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
title_short Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
title_sort large-scale evaluation of automated clinical note de-identification and its impact on information extraction
topic Focus on Patient Privacy
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3555323/
https://www.ncbi.nlm.nih.gov/pubmed/22859645
http://dx.doi.org/10.1136/amiajnl-2012-001012
work_keys_str_mv AT delegerlouise largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT molnarkatalin largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT savovaguergana largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT xiafei largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT lingrentodd largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT liqi largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT marsolokeith largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT jeggaanil largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT kaisermegan largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT stoutenboroughlaura largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction
AT soltiimre largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction