Cargando…

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an...

Descripción completa

Detalles Bibliográficos
Autores principales: Murugadoss, Karthik, Rajasekharan, Ajit, Malin, Bradley, Agarwal, Vineet, Bade, Sairam, Anderson, Jeff R., Ross, Jason L., Faubion, William A., Halamka, John D., Soundararajan, Venky, Ardhanari, Sankar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212138/
https://www.ncbi.nlm.nih.gov/pubmed/34179842
http://dx.doi.org/10.1016/j.patter.2021.100255
_version_ 1783709611700781056
author Murugadoss, Karthik
Rajasekharan, Ajit
Malin, Bradley
Agarwal, Vineet
Bade, Sairam
Anderson, Jeff R.
Ross, Jason L.
Faubion, William A.
Halamka, John D.
Soundararajan, Venky
Ardhanari, Sankar
author_facet Murugadoss, Karthik
Rajasekharan, Ajit
Malin, Bradley
Agarwal, Vineet
Bade, Sairam
Anderson, Jeff R.
Ross, Jason L.
Faubion, William A.
Halamka, John D.
Soundararajan, Venky
Ardhanari, Sankar
author_sort Murugadoss, Karthik
collection PubMed
description The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.
format Online
Article
Text
id pubmed-8212138
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-82121382021-06-25 Building a best-in-class automated de-identification tool for electronic health records through ensemble learning Murugadoss, Karthik Rajasekharan, Ajit Malin, Bradley Agarwal, Vineet Bade, Sairam Anderson, Jeff R. Ross, Jason L. Faubion, William A. Halamka, John D. Soundararajan, Venky Ardhanari, Sankar Patterns (N Y) Article The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries. Elsevier 2021-05-12 /pmc/articles/PMC8212138/ /pubmed/34179842 http://dx.doi.org/10.1016/j.patter.2021.100255 Text en © 2021 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Article
Murugadoss, Karthik
Rajasekharan, Ajit
Malin, Bradley
Agarwal, Vineet
Bade, Sairam
Anderson, Jeff R.
Ross, Jason L.
Faubion, William A.
Halamka, John D.
Soundararajan, Venky
Ardhanari, Sankar
Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
title Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
title_full Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
title_fullStr Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
title_full_unstemmed Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
title_short Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
title_sort building a best-in-class automated de-identification tool for electronic health records through ensemble learning
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212138/
https://www.ncbi.nlm.nih.gov/pubmed/34179842
http://dx.doi.org/10.1016/j.patter.2021.100255
work_keys_str_mv AT murugadosskarthik buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT rajasekharanajit buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT malinbradley buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT agarwalvineet buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT badesairam buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT andersonjeffr buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT rossjasonl buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT faubionwilliama buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT halamkajohnd buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT soundararajanvenky buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning
AT ardhanarisankar buildingabestinclassautomateddeidentificationtoolforelectronichealthrecordsthroughensemblelearning