Cargando…

A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records

In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranationa...

Descripción completa

Detalles Bibliográficos
Formato: Online Artículo Texto
Lenguaje:English
Publicado: IEEE 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545240/
https://www.ncbi.nlm.nih.gov/pubmed/34786303
http://dx.doi.org/10.1109/ACCESS.2021.3054479
_version_ 1784589975416733696
collection PubMed
description In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach.
format Online
Article
Text
id pubmed-8545240
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher IEEE
record_format MEDLINE/PubMed
spelling pubmed-85452402021-11-12 A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records IEEE Access Computational and Artificial Intelligence In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach. IEEE 2021-01-25 /pmc/articles/PMC8545240/ /pubmed/34786303 http://dx.doi.org/10.1109/ACCESS.2021.3054479 Text en This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
spellingShingle Computational and Artificial Intelligence
A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
title A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
title_full A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
title_fullStr A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
title_full_unstemmed A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
title_short A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
title_sort novel covid-19 data set and an effective deep learning approach for the de-identification of italian medical records
topic Computational and Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545240/
https://www.ncbi.nlm.nih.gov/pubmed/34786303
http://dx.doi.org/10.1109/ACCESS.2021.3054479
work_keys_str_mv AT anovelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT anovelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT anovelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT anovelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT anovelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT anovelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT novelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT novelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT novelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT novelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT novelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords
AT novelcovid19datasetandaneffectivedeeplearningapproachforthedeidentificationofitalianmedicalrecords