Cargando…

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to be...

Descripción completa

Detalles Bibliográficos
Autores principales: Pota, Marco, Ventura, Mirko, Catelli, Rosario, Esposito, Massimo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7796054/
https://www.ncbi.nlm.nih.gov/pubmed/33379231
http://dx.doi.org/10.3390/s21010133
_version_ 1783634590684938240
author Pota, Marco
Ventura, Mirko
Catelli, Rosario
Esposito, Massimo
author_facet Pota, Marco
Ventura, Mirko
Catelli, Rosario
Esposito, Massimo
author_sort Pota, Marco
collection PubMed
description Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.
format Online
Article
Text
id pubmed-7796054
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-77960542021-01-10 An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian Pota, Marco Ventura, Mirko Catelli, Rosario Esposito, Massimo Sensors (Basel) Article Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages. MDPI 2020-12-28 /pmc/articles/PMC7796054/ /pubmed/33379231 http://dx.doi.org/10.3390/s21010133 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Pota, Marco
Ventura, Mirko
Catelli, Rosario
Esposito, Massimo
An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian
title An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian
title_full An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian
title_fullStr An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian
title_full_unstemmed An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian
title_short An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian
title_sort effective bert-based pipeline for twitter sentiment analysis: a case study in italian
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7796054/
https://www.ncbi.nlm.nih.gov/pubmed/33379231
http://dx.doi.org/10.3390/s21010133
work_keys_str_mv AT potamarco aneffectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian
AT venturamirko aneffectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian
AT catellirosario aneffectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian
AT espositomassimo aneffectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian
AT potamarco effectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian
AT venturamirko effectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian
AT catellirosario effectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian
AT espositomassimo effectivebertbasedpipelinefortwittersentimentanalysisacasestudyinitalian