Cargando…

A longitudinal study of topic classification on Twitter

Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using sta...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bouadjenek, Mohamed Reda, Sanner, Scott, Iman, Zahra, Xie, Lexing, Shi, Daniel Xiaoliang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2022
Materias:	Data Science
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202616/ https://www.ncbi.nlm.nih.gov/pubmed/35721404 http://dx.doi.org/10.7717/peerj-cs.991

_version_	1784728567794368512
author	Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang
author_facet	Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang
author_sort	Bouadjenek, Mohamed Reda
collection	PubMed
description	Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance.
format	Online Article Text
id	pubmed-9202616
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-92026162022-06-17 A longitudinal study of topic classification on Twitter Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang PeerJ Comput Sci Data Science Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance. PeerJ Inc. 2022-06-07 /pmc/articles/PMC9202616/ /pubmed/35721404 http://dx.doi.org/10.7717/peerj-cs.991 Text en © 2022 Bouadjenek et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Data Science Bouadjenek, Mohamed Reda Sanner, Scott Iman, Zahra Xie, Lexing Shi, Daniel Xiaoliang A longitudinal study of topic classification on Twitter
title	A longitudinal study of topic classification on Twitter
title_full	A longitudinal study of topic classification on Twitter
title_fullStr	A longitudinal study of topic classification on Twitter
title_full_unstemmed	A longitudinal study of topic classification on Twitter
title_short	A longitudinal study of topic classification on Twitter
title_sort	longitudinal study of topic classification on twitter
topic	Data Science
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202616/ https://www.ncbi.nlm.nih.gov/pubmed/35721404 http://dx.doi.org/10.7717/peerj-cs.991
work_keys_str_mv	AT bouadjenekmohamedreda alongitudinalstudyoftopicclassificationontwitter AT sannerscott alongitudinalstudyoftopicclassificationontwitter AT imanzahra alongitudinalstudyoftopicclassificationontwitter AT xielexing alongitudinalstudyoftopicclassificationontwitter AT shidanielxiaoliang alongitudinalstudyoftopicclassificationontwitter AT bouadjenekmohamedreda longitudinalstudyoftopicclassificationontwitter AT sannerscott longitudinalstudyoftopicclassificationontwitter AT imanzahra longitudinalstudyoftopicclassificationontwitter AT xielexing longitudinalstudyoftopicclassificationontwitter AT shidanielxiaoliang longitudinalstudyoftopicclassificationontwitter

A longitudinal study of topic classification on Twitter

Ejemplares similares