Cargando…

PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic

Health-related information is considered as ‘highly sensitive’ by the European General Data Protection Regulations (GDPR) and determining whether a text document contains health-related information or not is of interest for both individuals and companies in a number of different scenarios. Although...

Descripción completa

Detalles Bibliográficos
Autores principales: Saniei, Rana, Rodríguez Doncel, Víctor
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Nature Singapore 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8983320/
https://www.ncbi.nlm.nih.gov/pubmed/35400014
http://dx.doi.org/10.1007/s42979-022-01097-x
_version_ 1784681961324806144
author Saniei, Rana
Rodríguez Doncel, Víctor
author_facet Saniei, Rana
Rodríguez Doncel, Víctor
author_sort Saniei, Rana
collection PubMed
description Health-related information is considered as ‘highly sensitive’ by the European General Data Protection Regulations (GDPR) and determining whether a text document contains health-related information or not is of interest for both individuals and companies in a number of different scenarios. Although some efforts have been made to detect different categories of personal data in texts, including health information, the classification task by machines is still challenging. In this work, we aim to contribute to solving this challenge by building a corpus of tweets being shared in the current COVID-19 pandemic context. The corpus is called PHDD(Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic) and contains 1,494 tweets which have been manually tagged by three taggers in three dimensions: health-sensitivity status, categories of health information, and subject of health history. Furthermore, a lightweight ontology called PTHI(Privacy Tags for Health Information), which reuses two other vocabularies, namely hl7 and dpv, is built to represent the corpus in a machine-readable format. The corpus is publicly available and can be used by NLP experts for implementation of techniques to detect sensitive health information in textual documents.
format Online
Article
Text
id pubmed-8983320
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer Nature Singapore
record_format MEDLINE/PubMed
spelling pubmed-89833202022-04-06 PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic Saniei, Rana Rodríguez Doncel, Víctor SN Comput Sci Original Research Health-related information is considered as ‘highly sensitive’ by the European General Data Protection Regulations (GDPR) and determining whether a text document contains health-related information or not is of interest for both individuals and companies in a number of different scenarios. Although some efforts have been made to detect different categories of personal data in texts, including health information, the classification task by machines is still challenging. In this work, we aim to contribute to solving this challenge by building a corpus of tweets being shared in the current COVID-19 pandemic context. The corpus is called PHDD(Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic) and contains 1,494 tweets which have been manually tagged by three taggers in three dimensions: health-sensitivity status, categories of health information, and subject of health history. Furthermore, a lightweight ontology called PTHI(Privacy Tags for Health Information), which reuses two other vocabularies, namely hl7 and dpv, is built to represent the corpus in a machine-readable format. The corpus is publicly available and can be used by NLP experts for implementation of techniques to detect sensitive health information in textual documents. Springer Nature Singapore 2022-04-06 2022 /pmc/articles/PMC8983320/ /pubmed/35400014 http://dx.doi.org/10.1007/s42979-022-01097-x Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Research
Saniei, Rana
Rodríguez Doncel, Víctor
PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic
title PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic
title_full PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic
title_fullStr PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic
title_full_unstemmed PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic
title_short PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic
title_sort phdd: corpus of physical health data disclosure on twitter during covid-19 pandemic
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8983320/
https://www.ncbi.nlm.nih.gov/pubmed/35400014
http://dx.doi.org/10.1007/s42979-022-01097-x
work_keys_str_mv AT sanieirana phddcorpusofphysicalhealthdatadisclosureontwitterduringcovid19pandemic
AT rodriguezdoncelvictor phddcorpusofphysicalhealthdatadisclosureontwitterduringcovid19pandemic