Cargando…
Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MA...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BMJ Group
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3555323/ https://www.ncbi.nlm.nih.gov/pubmed/22859645 http://dx.doi.org/10.1136/amiajnl-2012-001012 |
_version_ | 1782257019812577280 |
---|---|
author | Deleger, Louise Molnar, Katalin Savova, Guergana Xia, Fei Lingren, Todd Li, Qi Marsolo, Keith Jegga, Anil Kaiser, Megan Stoutenborough, Laura Solti, Imre |
author_facet | Deleger, Louise Molnar, Katalin Savova, Guergana Xia, Fei Lingren, Todd Li, Qi Marsolo, Keith Jegga, Anil Kaiser, Megan Stoutenborough, Laura Solti, Imre |
author_sort | Deleger, Louise |
collection | PubMed |
description | OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MATERIAL AND METHODS: A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated ‘gold standard’. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured. RESULTS: The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction. DISCUSSION AND CONCLUSION: NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively. |
format | Online Article Text |
id | pubmed-3555323 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BMJ Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-35553232013-12-14 Large-scale evaluation of automated clinical note de-identification and its impact on information extraction Deleger, Louise Molnar, Katalin Savova, Guergana Xia, Fei Lingren, Todd Li, Qi Marsolo, Keith Jegga, Anil Kaiser, Megan Stoutenborough, Laura Solti, Imre J Am Med Inform Assoc Focus on Patient Privacy OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MATERIAL AND METHODS: A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated ‘gold standard’. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured. RESULTS: The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction. DISCUSSION AND CONCLUSION: NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively. BMJ Group 2013 /pmc/articles/PMC3555323/ /pubmed/22859645 http://dx.doi.org/10.1136/amiajnl-2012-001012 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/3.0/ and http://creativecommons.org/licenses/by-nc/3.0/legalcode |
spellingShingle | Focus on Patient Privacy Deleger, Louise Molnar, Katalin Savova, Guergana Xia, Fei Lingren, Todd Li, Qi Marsolo, Keith Jegga, Anil Kaiser, Megan Stoutenborough, Laura Solti, Imre Large-scale evaluation of automated clinical note de-identification and its impact on information extraction |
title | Large-scale evaluation of automated clinical note de-identification and its impact on information extraction |
title_full | Large-scale evaluation of automated clinical note de-identification and its impact on information extraction |
title_fullStr | Large-scale evaluation of automated clinical note de-identification and its impact on information extraction |
title_full_unstemmed | Large-scale evaluation of automated clinical note de-identification and its impact on information extraction |
title_short | Large-scale evaluation of automated clinical note de-identification and its impact on information extraction |
title_sort | large-scale evaluation of automated clinical note de-identification and its impact on information extraction |
topic | Focus on Patient Privacy |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3555323/ https://www.ncbi.nlm.nih.gov/pubmed/22859645 http://dx.doi.org/10.1136/amiajnl-2012-001012 |
work_keys_str_mv | AT delegerlouise largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT molnarkatalin largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT savovaguergana largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT xiafei largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT lingrentodd largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT liqi largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT marsolokeith largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT jeggaanil largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT kaisermegan largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT stoutenboroughlaura largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction AT soltiimre largescaleevaluationofautomatedclinicalnotedeidentificationanditsimpactoninformationextraction |