Cargando…

Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification

Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained t...

Descripción completa

Detalles Bibliográficos
Autores principales: Guo, Yuting, Ge, Yao, Yang, Yuan-Chi, Al-Garadi, Mohammed Ali, Sarker, Abeed
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9408372/
https://www.ncbi.nlm.nih.gov/pubmed/36011135
http://dx.doi.org/10.3390/healthcare10081478
_version_ 1784774585082707968
author Guo, Yuting
Ge, Yao
Yang, Yuan-Chi
Al-Garadi, Mohammed Ali
Sarker, Abeed
author_facet Guo, Yuting
Ge, Yao
Yang, Yuan-Chi
Al-Garadi, Mohammed Ali
Sarker, Abeed
author_sort Guo, Yuting
collection PubMed
description Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.
format Online
Article
Text
id pubmed-9408372
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-94083722022-08-26 Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification Guo, Yuting Ge, Yao Yang, Yuan-Chi Al-Garadi, Mohammed Ali Sarker, Abeed Healthcare (Basel) Article Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance. MDPI 2022-08-05 /pmc/articles/PMC9408372/ /pubmed/36011135 http://dx.doi.org/10.3390/healthcare10081478 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Guo, Yuting
Ge, Yao
Yang, Yuan-Chi
Al-Garadi, Mohammed Ali
Sarker, Abeed
Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification
title Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification
title_full Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification
title_fullStr Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification
title_full_unstemmed Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification
title_short Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification
title_sort comparison of pretraining models and strategies for health-related social media text classification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9408372/
https://www.ncbi.nlm.nih.gov/pubmed/36011135
http://dx.doi.org/10.3390/healthcare10081478
work_keys_str_mv AT guoyuting comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification
AT geyao comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification
AT yangyuanchi comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification
AT algaradimohammedali comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification
AT sarkerabeed comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification