Cargando…
Automatic gender detection in Twitter profiles for health-related cohort studies
OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220305/ https://www.ncbi.nlm.nih.gov/pubmed/34169232 http://dx.doi.org/10.1093/jamiaopen/ooab042 |
_version_ | 1783711120088891392 |
---|---|
author | Yang, Yuan-Chi Al-Garadi, Mohammed Ali Love, Jennifer S Perrone, Jeanmarie Sarker, Abeed |
author_facet | Yang, Yuan-Chi Al-Garadi, Mohammed Ali Love, Jennifer S Perrone, Jeanmarie Sarker, Abeed |
author_sort | Yang, Yuan-Chi |
collection | PubMed |
description | OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. MATERIALS AND METHODS: We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users’ information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system’s utility. RESULTS AND DISCUSSION: We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0–94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0–96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends—proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37). CONCLUSION: Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public). |
format | Online Article Text |
id | pubmed-8220305 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-82203052021-06-23 Automatic gender detection in Twitter profiles for health-related cohort studies Yang, Yuan-Chi Al-Garadi, Mohammed Ali Love, Jennifer S Perrone, Jeanmarie Sarker, Abeed JAMIA Open Research and Applications OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. MATERIALS AND METHODS: We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users’ information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system’s utility. RESULTS AND DISCUSSION: We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0–94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0–96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends—proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37). CONCLUSION: Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public). Oxford University Press 2021-06-23 /pmc/articles/PMC8220305/ /pubmed/34169232 http://dx.doi.org/10.1093/jamiaopen/ooab042 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Research and Applications Yang, Yuan-Chi Al-Garadi, Mohammed Ali Love, Jennifer S Perrone, Jeanmarie Sarker, Abeed Automatic gender detection in Twitter profiles for health-related cohort studies |
title | Automatic gender detection in Twitter profiles for health-related cohort studies |
title_full | Automatic gender detection in Twitter profiles for health-related cohort studies |
title_fullStr | Automatic gender detection in Twitter profiles for health-related cohort studies |
title_full_unstemmed | Automatic gender detection in Twitter profiles for health-related cohort studies |
title_short | Automatic gender detection in Twitter profiles for health-related cohort studies |
title_sort | automatic gender detection in twitter profiles for health-related cohort studies |
topic | Research and Applications |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220305/ https://www.ncbi.nlm.nih.gov/pubmed/34169232 http://dx.doi.org/10.1093/jamiaopen/ooab042 |
work_keys_str_mv | AT yangyuanchi automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies AT algaradimohammedali automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies AT lovejennifers automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies AT perronejeanmarie automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies AT sarkerabeed automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies |