Cargando…
Automatic gender detection in Twitter profiles for health-related cohort studies
OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220305/ https://www.ncbi.nlm.nih.gov/pubmed/34169232 http://dx.doi.org/10.1093/jamiaopen/ooab042 |
Sumario: | OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. MATERIALS AND METHODS: We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users’ information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system’s utility. RESULTS AND DISCUSSION: We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0–94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0–96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends—proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37). CONCLUSION: Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public). |
---|