Cargando…

Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada

BACKGROUND: The use of social media data provides an opportunity to complement traditional influenza and COVID-19 surveillance methods for the detection and control of outbreaks and informing public health interventions. OBJECTIVE: The first aim of this study is to investigate the degree to which Tw...

Descripción completa

Detalles Bibliográficos
Autores principales: Tian, Yuan, Zhang, Wenjing, Duan, Lujie, McDonald, Wade, Osgood, Nathaniel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338115/
https://www.ncbi.nlm.nih.gov/pubmed/37448834
http://dx.doi.org/10.3389/fdgth.2023.1203874
_version_ 1785071559278329856
author Tian, Yuan
Zhang, Wenjing
Duan, Lujie
McDonald, Wade
Osgood, Nathaniel
author_facet Tian, Yuan
Zhang, Wenjing
Duan, Lujie
McDonald, Wade
Osgood, Nathaniel
author_sort Tian, Yuan
collection PubMed
description BACKGROUND: The use of social media data provides an opportunity to complement traditional influenza and COVID-19 surveillance methods for the detection and control of outbreaks and informing public health interventions. OBJECTIVE: The first aim of this study is to investigate the degree to which Twitter users disclose health experiences related to influenza and COVID-19 that could be indicative of recent plausible influenza cases or symptomatic COVID-19 infections. Second, we seek to use the Twitter datasets to train and evaluate the classification performance of Bidirectional Encoder Representations from Transformers (BERT) and variant language models in the context of influenza and COVID-19 infection detection. METHODS: We constructed two Twitter datasets using a keyword-based filtering approach on English-language tweets collected from December 2016 to December 2022 in Saskatchewan, Canada. The influenza-related dataset comprised tweets filtered with influenza-related keywords from December 13, 2016, to March 17, 2018, while the COVID-19 dataset comprised tweets filtered with COVID-19 symptom-related keywords from January 1, 2020, to June 22, 2021. The Twitter datasets were cleaned, and each tweet was annotated by at least two annotators as to whether it suggested recent plausible influenza cases or symptomatic COVID-19 cases. We then assessed the classification performance of pre-trained transformer-based language models, including BERT-base, BERT-large, RoBERTa-base, RoBERT-large, BERTweet-base, BERTweet-covid-base, BERTweet-large, and COVID-Twitter-BERT (CT-BERT) models, on each dataset. To address the notable class imbalance, we experimented with both oversampling and undersampling methods. RESULTS: The influenza dataset had 1129 out of 6444 (17.5%) tweets annotated as suggesting recent plausible influenza cases. The COVID-19 dataset had 924 out of 11939 (7.7%) tweets annotated as inferring recent plausible COVID-19 cases. When compared against other language models on the COVID-19 dataset, CT-BERT performed the best, supporting the highest scores for recall (94.8%), F1(94.4%), and accuracy (94.6%). For the influenza dataset, BERTweet models exhibited better performance. Our results also showed that applying data balancing techniques such as oversampling or undersampling method did not lead to improved model performance. CONCLUSIONS: Utilizing domain-specific language models for monitoring users’ health experiences related to influenza and COVID-19 on social media shows improved classification performance and has the potential to supplement real-time disease surveillance.
format Online
Article
Text
id pubmed-10338115
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-103381152023-07-13 Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada Tian, Yuan Zhang, Wenjing Duan, Lujie McDonald, Wade Osgood, Nathaniel Front Digit Health Digital Health BACKGROUND: The use of social media data provides an opportunity to complement traditional influenza and COVID-19 surveillance methods for the detection and control of outbreaks and informing public health interventions. OBJECTIVE: The first aim of this study is to investigate the degree to which Twitter users disclose health experiences related to influenza and COVID-19 that could be indicative of recent plausible influenza cases or symptomatic COVID-19 infections. Second, we seek to use the Twitter datasets to train and evaluate the classification performance of Bidirectional Encoder Representations from Transformers (BERT) and variant language models in the context of influenza and COVID-19 infection detection. METHODS: We constructed two Twitter datasets using a keyword-based filtering approach on English-language tweets collected from December 2016 to December 2022 in Saskatchewan, Canada. The influenza-related dataset comprised tweets filtered with influenza-related keywords from December 13, 2016, to March 17, 2018, while the COVID-19 dataset comprised tweets filtered with COVID-19 symptom-related keywords from January 1, 2020, to June 22, 2021. The Twitter datasets were cleaned, and each tweet was annotated by at least two annotators as to whether it suggested recent plausible influenza cases or symptomatic COVID-19 cases. We then assessed the classification performance of pre-trained transformer-based language models, including BERT-base, BERT-large, RoBERTa-base, RoBERT-large, BERTweet-base, BERTweet-covid-base, BERTweet-large, and COVID-Twitter-BERT (CT-BERT) models, on each dataset. To address the notable class imbalance, we experimented with both oversampling and undersampling methods. RESULTS: The influenza dataset had 1129 out of 6444 (17.5%) tweets annotated as suggesting recent plausible influenza cases. The COVID-19 dataset had 924 out of 11939 (7.7%) tweets annotated as inferring recent plausible COVID-19 cases. When compared against other language models on the COVID-19 dataset, CT-BERT performed the best, supporting the highest scores for recall (94.8%), F1(94.4%), and accuracy (94.6%). For the influenza dataset, BERTweet models exhibited better performance. Our results also showed that applying data balancing techniques such as oversampling or undersampling method did not lead to improved model performance. CONCLUSIONS: Utilizing domain-specific language models for monitoring users’ health experiences related to influenza and COVID-19 on social media shows improved classification performance and has the potential to supplement real-time disease surveillance. Frontiers Media S.A. 2023-06-28 /pmc/articles/PMC10338115/ /pubmed/37448834 http://dx.doi.org/10.3389/fdgth.2023.1203874 Text en © 2023 Tian, Zhang, Duan, McDonald and Osgood. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) (https://creativecommons.org/licenses/by/4.0/) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Digital Health
Tian, Yuan
Zhang, Wenjing
Duan, Lujie
McDonald, Wade
Osgood, Nathaniel
Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada
title Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada
title_full Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada
title_fullStr Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada
title_full_unstemmed Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada
title_short Comparison of pretrained transformer-based models for influenza and COVID-19 detection using social media text data in Saskatchewan, Canada
title_sort comparison of pretrained transformer-based models for influenza and covid-19 detection using social media text data in saskatchewan, canada
topic Digital Health
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338115/
https://www.ncbi.nlm.nih.gov/pubmed/37448834
http://dx.doi.org/10.3389/fdgth.2023.1203874
work_keys_str_mv AT tianyuan comparisonofpretrainedtransformerbasedmodelsforinfluenzaandcovid19detectionusingsocialmediatextdatainsaskatchewancanada
AT zhangwenjing comparisonofpretrainedtransformerbasedmodelsforinfluenzaandcovid19detectionusingsocialmediatextdatainsaskatchewancanada
AT duanlujie comparisonofpretrainedtransformerbasedmodelsforinfluenzaandcovid19detectionusingsocialmediatextdatainsaskatchewancanada
AT mcdonaldwade comparisonofpretrainedtransformerbasedmodelsforinfluenzaandcovid19detectionusingsocialmediatextdatainsaskatchewancanada
AT osgoodnathaniel comparisonofpretrainedtransformerbasedmodelsforinfluenzaandcovid19detectionusingsocialmediatextdatainsaskatchewancanada