Cargando…
A longitudinal study of topic classification on Twitter
Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using sta...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202616/ https://www.ncbi.nlm.nih.gov/pubmed/35721404 http://dx.doi.org/10.7717/peerj-cs.991 |
_version_ | 1784728567794368512 |
---|---|
author | Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang |
author_facet | Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang |
author_sort | Bouadjenek, Mohamed Reda |
collection | PubMed |
description | Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance. |
format | Online Article Text |
id | pubmed-9202616 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-92026162022-06-17 A longitudinal study of topic classification on Twitter Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang PeerJ Comput Sci Data Science Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance. PeerJ Inc. 2022-06-07 /pmc/articles/PMC9202616/ /pubmed/35721404 http://dx.doi.org/10.7717/peerj-cs.991 Text en © 2022 Bouadjenek et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Data Science Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang A longitudinal study of topic classification on Twitter |
title | A longitudinal study of topic classification on Twitter |
title_full | A longitudinal study of topic classification on Twitter |
title_fullStr | A longitudinal study of topic classification on Twitter |
title_full_unstemmed | A longitudinal study of topic classification on Twitter |
title_short | A longitudinal study of topic classification on Twitter |
title_sort | longitudinal study of topic classification on twitter |
topic | Data Science |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202616/ https://www.ncbi.nlm.nih.gov/pubmed/35721404 http://dx.doi.org/10.7717/peerj-cs.991 |
work_keys_str_mv | AT bouadjenekmohamedreda alongitudinalstudyoftopicclassificationontwitter AT sannerscott alongitudinalstudyoftopicclassificationontwitter AT imanzahra alongitudinalstudyoftopicclassificationontwitter AT xielexing alongitudinalstudyoftopicclassificationontwitter AT shidanielxiaoliang alongitudinalstudyoftopicclassificationontwitter AT bouadjenekmohamedreda longitudinalstudyoftopicclassificationontwitter AT sannerscott longitudinalstudyoftopicclassificationontwitter AT imanzahra longitudinalstudyoftopicclassificationontwitter AT xielexing longitudinalstudyoftopicclassificationontwitter AT shidanielxiaoliang longitudinalstudyoftopicclassificationontwitter |