Cargando…
Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification
Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained t...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9408372/ https://www.ncbi.nlm.nih.gov/pubmed/36011135 http://dx.doi.org/10.3390/healthcare10081478 |
_version_ | 1784774585082707968 |
---|---|
author | Guo, Yuting Ge, Yao Yang, Yuan-Chi Al-Garadi, Mohammed Ali Sarker, Abeed |
author_facet | Guo, Yuting Ge, Yao Yang, Yuan-Chi Al-Garadi, Mohammed Ali Sarker, Abeed |
author_sort | Guo, Yuting |
collection | PubMed |
description | Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance. |
format | Online Article Text |
id | pubmed-9408372 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-94083722022-08-26 Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification Guo, Yuting Ge, Yao Yang, Yuan-Chi Al-Garadi, Mohammed Ali Sarker, Abeed Healthcare (Basel) Article Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance. MDPI 2022-08-05 /pmc/articles/PMC9408372/ /pubmed/36011135 http://dx.doi.org/10.3390/healthcare10081478 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Guo, Yuting Ge, Yao Yang, Yuan-Chi Al-Garadi, Mohammed Ali Sarker, Abeed Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification |
title | Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification |
title_full | Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification |
title_fullStr | Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification |
title_full_unstemmed | Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification |
title_short | Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification |
title_sort | comparison of pretraining models and strategies for health-related social media text classification |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9408372/ https://www.ncbi.nlm.nih.gov/pubmed/36011135 http://dx.doi.org/10.3390/healthcare10081478 |
work_keys_str_mv | AT guoyuting comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification AT geyao comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification AT yangyuanchi comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification AT algaradimohammedali comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification AT sarkerabeed comparisonofpretrainingmodelsandstrategiesforhealthrelatedsocialmediatextclassification |