Cargando…
Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212138/ https://www.ncbi.nlm.nih.gov/pubmed/34179842 http://dx.doi.org/10.1016/j.patter.2021.100255 |
_version_ | 1783709611700781056 |
---|---|
author | Murugadoss, Karthik Rajasekharan, Ajit Malin, Bradley Agarwal, Vineet Bade, Sairam Anderson, Jeff R. Ross, Jason L. Faubion, William A. Halamka, John D. Soundararajan, Venky Ardhanari, Sankar |
author_facet | Murugadoss, Karthik Rajasekharan, Ajit Malin, Bradley Agarwal, Vineet Bade, Sairam Anderson, Jeff R. Ross, Jason L. Faubion, William A. Halamka, John D. Soundararajan, Venky Ardhanari, Sankar |
author_sort | Murugadoss, Karthik |
collection | PubMed |
description | The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries. |
format | Online Article Text |
id | pubmed-8212138 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-82121382021-06-25 Building a best-in-class automated de-identification tool for electronic health records through ensemble learning Murugadoss, Karthik Rajasekharan, Ajit Malin, Bradley Agarwal, Vineet Bade, Sairam Anderson, Jeff R. Ross, Jason L. Faubion, William A. Halamka, John D. Soundararajan, Venky Ardhanari, Sankar Patterns (N Y) Article The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries. Elsevier 2021-05-12 /pmc/articles/PMC8212138/ /pubmed/34179842 http://dx.doi.org/10.1016/j.patter.2021.100255 Text en © 2021 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Article Murugadoss, Karthik Rajasekharan, Ajit Malin, Bradley Agarwal, Vineet Bade, Sairam Anderson, Jeff R. Ross, Jason L. Faubion, William A. Halamka, John D. Soundararajan, Venky Ardhanari, Sankar Building a best-in-class automated de-identification tool for electronic health records through ensemble learning |
title | Building a best-in-class automated de-identification tool for electronic health records through ensemble learning |
title_full | Building a best-in-class automated de-identification tool for electronic health records through ensemble learning |
title_fullStr | Building a best-in-class automated de-identification tool for electronic health records through ensemble learning |
title_full_unstemmed | Building a best-in-class automated de-identification tool for electronic health records through ensemble learning |
title_short | Building a best-in-class automated de-identification tool for electronic health records through ensemble learning |
title_sort | building a best-in-class automated de-identification tool for electronic health records through ensemble learning |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212138/ https://www.ncbi.nlm.nih.gov/pubmed/34179842 http://dx.doi.org/10.1016/j.patter.2021.100255 |
work_keys_str_mv | AT murugadosskarthik buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT rajasekharanajit buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT malinbradley buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT agarwalvineet buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT badesairam buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT andersonjeffr buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT rossjasonl buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT faubionwilliama buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT halamkajohnd buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT soundararajanvenky buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning AT ardhanarisankar buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning |