Cargando…

Automatic gender detection in Twitter profiles for health-related cohort studies

OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Yuan-Chi, Al-Garadi, Mohammed Ali, Love, Jennifer S, Perrone, Jeanmarie, Sarker, Abeed
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220305/
https://www.ncbi.nlm.nih.gov/pubmed/34169232
http://dx.doi.org/10.1093/jamiaopen/ooab042
_version_ 1783711120088891392
author Yang, Yuan-Chi
Al-Garadi, Mohammed Ali
Love, Jennifer S
Perrone, Jeanmarie
Sarker, Abeed
author_facet Yang, Yuan-Chi
Al-Garadi, Mohammed Ali
Love, Jennifer S
Perrone, Jeanmarie
Sarker, Abeed
author_sort Yang, Yuan-Chi
collection PubMed
description OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. MATERIALS AND METHODS: We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users’ information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system’s utility. RESULTS AND DISCUSSION: We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0–94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0–96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends—proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37). CONCLUSION: Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public).
format Online
Article
Text
id pubmed-8220305
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-82203052021-06-23 Automatic gender detection in Twitter profiles for health-related cohort studies Yang, Yuan-Chi Al-Garadi, Mohammed Ali Love, Jennifer S Perrone, Jeanmarie Sarker, Abeed JAMIA Open Research and Applications OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. MATERIALS AND METHODS: We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users’ information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system’s utility. RESULTS AND DISCUSSION: We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0–94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0–96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends—proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37). CONCLUSION: Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public). Oxford University Press 2021-06-23 /pmc/articles/PMC8220305/ /pubmed/34169232 http://dx.doi.org/10.1093/jamiaopen/ooab042 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Research and Applications
Yang, Yuan-Chi
Al-Garadi, Mohammed Ali
Love, Jennifer S
Perrone, Jeanmarie
Sarker, Abeed
Automatic gender detection in Twitter profiles for health-related cohort studies
title Automatic gender detection in Twitter profiles for health-related cohort studies
title_full Automatic gender detection in Twitter profiles for health-related cohort studies
title_fullStr Automatic gender detection in Twitter profiles for health-related cohort studies
title_full_unstemmed Automatic gender detection in Twitter profiles for health-related cohort studies
title_short Automatic gender detection in Twitter profiles for health-related cohort studies
title_sort automatic gender detection in twitter profiles for health-related cohort studies
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220305/
https://www.ncbi.nlm.nih.gov/pubmed/34169232
http://dx.doi.org/10.1093/jamiaopen/ooab042
work_keys_str_mv AT yangyuanchi automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies
AT algaradimohammedali automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies
AT lovejennifers automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies
AT perronejeanmarie automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies
AT sarkerabeed automaticgenderdetectionintwitterprofilesforhealthrelatedcohortstudies