Cargando…

A longitudinal study of topic classification on Twitter

Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using sta...

Descripción completa

Detalles Bibliográficos
Autores principales: Bouadjenek, Mohamed Reda, Sanner, Scott, Iman, Zahra, Xie, Lexing, Shi, Daniel Xiaoliang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202616/
https://www.ncbi.nlm.nih.gov/pubmed/35721404
http://dx.doi.org/10.7717/peerj-cs.991
_version_ 1784728567794368512
author Bouadjenek, Mohamed Reda
Sanner, Scott
Iman, Zahra
Xie, Lexing
Shi, Daniel Xiaoliang
author_facet Bouadjenek, Mohamed Reda
Sanner, Scott
Iman, Zahra
Xie, Lexing
Shi, Daniel Xiaoliang
author_sort Bouadjenek, Mohamed Reda
collection PubMed
description Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance.
format Online
Article
Text
id pubmed-9202616
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-92026162022-06-17 A longitudinal study of topic classification on Twitter Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang PeerJ Comput Sci Data Science Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance. PeerJ Inc. 2022-06-07 /pmc/articles/PMC9202616/ /pubmed/35721404 http://dx.doi.org/10.7717/peerj-cs.991 Text en © 2022 Bouadjenek et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Data Science
Bouadjenek, Mohamed Reda
Sanner, Scott
Iman, Zahra
Xie, Lexing
Shi, Daniel Xiaoliang
A longitudinal study of topic classification on Twitter
title A longitudinal study of topic classification on Twitter
title_full A longitudinal study of topic classification on Twitter
title_fullStr A longitudinal study of topic classification on Twitter
title_full_unstemmed A longitudinal study of topic classification on Twitter
title_short A longitudinal study of topic classification on Twitter
title_sort longitudinal study of topic classification on twitter
topic Data Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202616/
https://www.ncbi.nlm.nih.gov/pubmed/35721404
http://dx.doi.org/10.7717/peerj-cs.991
work_keys_str_mv AT bouadjenekmohamedreda alongitudinalstudyoftopicclassificationontwitter
AT sannerscott alongitudinalstudyoftopicclassificationontwitter
AT imanzahra alongitudinalstudyoftopicclassificationontwitter
AT xielexing alongitudinalstudyoftopicclassificationontwitter
AT shidanielxiaoliang alongitudinalstudyoftopicclassificationontwitter
AT bouadjenekmohamedreda longitudinalstudyoftopicclassificationontwitter
AT sannerscott longitudinalstudyoftopicclassificationontwitter
AT imanzahra longitudinalstudyoftopicclassificationontwitter
AT xielexing longitudinalstudyoftopicclassificationontwitter
AT shidanielxiaoliang longitudinalstudyoftopicclassificationontwitter