Cargando…

Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation

BACKGROUND: Social media are important for monitoring perceptions of public health issues and for educating target audiences about health; however, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences and limits how we...

Descripción completa

Detalles Bibliográficos
Autores principales: Chew, Robert, Kery, Caroline, Baum, Laura, Bukowski, Thomas, Kim, Annice, Navarro, Mario
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087286/
https://www.ncbi.nlm.nih.gov/pubmed/33724195
http://dx.doi.org/10.2196/25807
_version_ 1783686637419495424
author Chew, Robert
Kery, Caroline
Baum, Laura
Bukowski, Thomas
Kim, Annice
Navarro, Mario
author_facet Chew, Robert
Kery, Caroline
Baum, Laura
Bukowski, Thomas
Kim, Annice
Navarro, Mario
author_sort Chew, Robert
collection PubMed
description BACKGROUND: Social media are important for monitoring perceptions of public health issues and for educating target audiences about health; however, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences and limits how well social media can be used for public health surveillance and education outreach efforts. Certain social media platforms provide demographic information on followers of a user account, if given, but they are not always disclosed, and researchers have developed machine learning algorithms to predict social media users’ demographic characteristics, mainly for Twitter. To date, there has been limited research on predicting the demographic characteristics of Reddit users. OBJECTIVE: We aimed to develop a machine learning algorithm that predicts the age segment of Reddit users, as either adolescents or adults, based on publicly available data. METHODS: This study was conducted between January and September 2020 using publicly available Reddit posts as input data. We manually labeled Reddit users’ age by identifying and reviewing public posts in which Reddit users self-reported their age. We then collected sample posts, comments, and metadata for the labeled user accounts and created variables to capture linguistic patterns, posting behavior, and account details that would distinguish the adolescent age group (aged 13 to 20 years) from the adult age group (aged 21 to 54 years). We split the data into training (n=1660) and test sets (n=415) and performed 5-fold cross validation on the training set to select hyperparameters and perform feature selection. We ran multiple classification algorithms and tested the performance of the models (precision, recall, F1 score) in predicting the age segments of the users in the labeled data. To evaluate associations between each feature and the outcome, we calculated means and confidence intervals and compared the two age groups, with 2-sample t tests, for each transformed model feature. RESULTS: The gradient boosted trees classifier performed the best, with an F1 score of 0.78. The test set precision and recall scores were 0.79 and 0.89, respectively, for the adolescent group (n=254) and 0.78 and 0.63, respectively, for the adult group (n=161). The most important feature in the model was the number of sentences per comment (permutation score: mean 0.100, SD 0.004). Members of the adolescent age group tended to have created accounts more recently, have higher proportions of submissions and comments in the r/teenagers subreddit, and post more in subreddits with higher subscriber counts than those in the adult group. CONCLUSIONS: We created a Reddit age prediction algorithm with competitive accuracy using publicly available data, suggesting machine learning methods can help public health agencies identify age-related target audiences on Reddit. Our results also suggest that there are characteristics of Reddit users’ posting behavior, linguistic patterns, and account features that distinguish adolescents from adults.
format Online
Article
Text
id pubmed-8087286
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-80872862021-05-07 Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation Chew, Robert Kery, Caroline Baum, Laura Bukowski, Thomas Kim, Annice Navarro, Mario JMIR Public Health Surveill Original Paper BACKGROUND: Social media are important for monitoring perceptions of public health issues and for educating target audiences about health; however, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences and limits how well social media can be used for public health surveillance and education outreach efforts. Certain social media platforms provide demographic information on followers of a user account, if given, but they are not always disclosed, and researchers have developed machine learning algorithms to predict social media users’ demographic characteristics, mainly for Twitter. To date, there has been limited research on predicting the demographic characteristics of Reddit users. OBJECTIVE: We aimed to develop a machine learning algorithm that predicts the age segment of Reddit users, as either adolescents or adults, based on publicly available data. METHODS: This study was conducted between January and September 2020 using publicly available Reddit posts as input data. We manually labeled Reddit users’ age by identifying and reviewing public posts in which Reddit users self-reported their age. We then collected sample posts, comments, and metadata for the labeled user accounts and created variables to capture linguistic patterns, posting behavior, and account details that would distinguish the adolescent age group (aged 13 to 20 years) from the adult age group (aged 21 to 54 years). We split the data into training (n=1660) and test sets (n=415) and performed 5-fold cross validation on the training set to select hyperparameters and perform feature selection. We ran multiple classification algorithms and tested the performance of the models (precision, recall, F1 score) in predicting the age segments of the users in the labeled data. To evaluate associations between each feature and the outcome, we calculated means and confidence intervals and compared the two age groups, with 2-sample t tests, for each transformed model feature. RESULTS: The gradient boosted trees classifier performed the best, with an F1 score of 0.78. The test set precision and recall scores were 0.79 and 0.89, respectively, for the adolescent group (n=254) and 0.78 and 0.63, respectively, for the adult group (n=161). The most important feature in the model was the number of sentences per comment (permutation score: mean 0.100, SD 0.004). Members of the adolescent age group tended to have created accounts more recently, have higher proportions of submissions and comments in the r/teenagers subreddit, and post more in subreddits with higher subscriber counts than those in the adult group. CONCLUSIONS: We created a Reddit age prediction algorithm with competitive accuracy using publicly available data, suggesting machine learning methods can help public health agencies identify age-related target audiences on Reddit. Our results also suggest that there are characteristics of Reddit users’ posting behavior, linguistic patterns, and account features that distinguish adolescents from adults. JMIR Publications 2021-03-16 /pmc/articles/PMC8087286/ /pubmed/33724195 http://dx.doi.org/10.2196/25807 Text en ©Robert Chew, Caroline Kery, Laura Baum, Thomas Bukowski, Annice Kim, Mario Navarro. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 16.03.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
spellingShingle Original Paper
Chew, Robert
Kery, Caroline
Baum, Laura
Bukowski, Thomas
Kim, Annice
Navarro, Mario
Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation
title Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation
title_full Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation
title_fullStr Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation
title_full_unstemmed Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation
title_short Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation
title_sort predicting age groups of reddit users based on posting behavior and metadata: classification model development and validation
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087286/
https://www.ncbi.nlm.nih.gov/pubmed/33724195
http://dx.doi.org/10.2196/25807
work_keys_str_mv AT chewrobert predictingagegroupsofredditusersbasedonpostingbehaviorandmetadataclassificationmodeldevelopmentandvalidation
AT kerycaroline predictingagegroupsofredditusersbasedonpostingbehaviorandmetadataclassificationmodeldevelopmentandvalidation
AT baumlaura predictingagegroupsofredditusersbasedonpostingbehaviorandmetadataclassificationmodeldevelopmentandvalidation
AT bukowskithomas predictingagegroupsofredditusersbasedonpostingbehaviorandmetadataclassificationmodeldevelopmentandvalidation
AT kimannice predictingagegroupsofredditusersbasedonpostingbehaviorandmetadataclassificationmodeldevelopmentandvalidation
AT navarromario predictingagegroupsofredditusersbasedonpostingbehaviorandmetadataclassificationmodeldevelopmentandvalidation