Cargando…
Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis
BACKGROUND: Pregnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematical...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5684515/ https://www.ncbi.nlm.nih.gov/pubmed/29084707 http://dx.doi.org/10.2196/jmir.8164 |
_version_ | 1783278494835277824 |
---|---|
author | Sarker, Abeed Chandrashekar, Pramod Magge, Arjun Cai, Haitao Klein, Ari Gonzalez, Graciela |
author_facet | Sarker, Abeed Chandrashekar, Pramod Magge, Arjun Cai, Haitao Klein, Ari Gonzalez, Graciela |
author_sort | Sarker, Abeed |
collection | PubMed |
description | BACKGROUND: Pregnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematically collect information regarding specific pregnancy outcomes. Although the model of pregnancy registries has distinct advantages over other study designs, they are faced with numerous challenges and limitations such as low enrollment rate, high cost, and selection bias. OBJECTIVE: The primary objectives of this study were to systematically assess whether social media (Twitter) can be used to discover cohorts of pregnant women and to develop and deploy a natural language processing and machine learning pipeline for the automatic collection of cohort information. In addition, we also attempted to ascertain, in a preliminary fashion, what types of longitudinal information may potentially be mined from the collected cohort information. METHODS: Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined. RESULTS: Our rule-based PIT detection approach retrieved over 200,000 posts over a period of 18 months. Manual annotation agreement for three annotators was very high at kappa (κ)=.79. On a blind test set, the implemented classifier obtained an overall F(1) score of 0.84 (0.88 for the pregnancy class and 0.68 for the nonpregnancy class). Precision for the pregnancy class was 0.93, and recall was 0.84. Feature analysis showed that the combination of dense and sparse vectors for classification achieved optimal performance. Employing the trained classifier resulted in the identification of 71,954 users from the collected posts. Over 250 million posts were retrieved for these users, which provided a multitude of longitudinal information about them. CONCLUSIONS: Social media sources such as Twitter can be used to identify large cohorts of pregnant women and to gather longitudinal information via automated processing of their postings. Considering the many drawbacks and limitations of pregnancy registries, social media mining may provide beneficial complementary information. Although the cohort sizes identified over social media are large, future research will have to assess the completeness of the information available through them. |
format | Online Article Text |
id | pubmed-5684515 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-56845152017-11-20 Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis Sarker, Abeed Chandrashekar, Pramod Magge, Arjun Cai, Haitao Klein, Ari Gonzalez, Graciela J Med Internet Res Original Paper BACKGROUND: Pregnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematically collect information regarding specific pregnancy outcomes. Although the model of pregnancy registries has distinct advantages over other study designs, they are faced with numerous challenges and limitations such as low enrollment rate, high cost, and selection bias. OBJECTIVE: The primary objectives of this study were to systematically assess whether social media (Twitter) can be used to discover cohorts of pregnant women and to develop and deploy a natural language processing and machine learning pipeline for the automatic collection of cohort information. In addition, we also attempted to ascertain, in a preliminary fashion, what types of longitudinal information may potentially be mined from the collected cohort information. METHODS: Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined. RESULTS: Our rule-based PIT detection approach retrieved over 200,000 posts over a period of 18 months. Manual annotation agreement for three annotators was very high at kappa (κ)=.79. On a blind test set, the implemented classifier obtained an overall F(1) score of 0.84 (0.88 for the pregnancy class and 0.68 for the nonpregnancy class). Precision for the pregnancy class was 0.93, and recall was 0.84. Feature analysis showed that the combination of dense and sparse vectors for classification achieved optimal performance. Employing the trained classifier resulted in the identification of 71,954 users from the collected posts. Over 250 million posts were retrieved for these users, which provided a multitude of longitudinal information about them. CONCLUSIONS: Social media sources such as Twitter can be used to identify large cohorts of pregnant women and to gather longitudinal information via automated processing of their postings. Considering the many drawbacks and limitations of pregnancy registries, social media mining may provide beneficial complementary information. Although the cohort sizes identified over social media are large, future research will have to assess the completeness of the information available through them. JMIR Publications 2017-10-30 /pmc/articles/PMC5684515/ /pubmed/29084707 http://dx.doi.org/10.2196/jmir.8164 Text en ©Abeed Sarker, Pramod Chandrashekar, Arjun Magge, Haitao Cai, Ari Klein, Graciela Gonzalez. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 30.10.2017. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Sarker, Abeed Chandrashekar, Pramod Magge, Arjun Cai, Haitao Klein, Ari Gonzalez, Graciela Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis |
title | Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis |
title_full | Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis |
title_fullStr | Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis |
title_full_unstemmed | Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis |
title_short | Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis |
title_sort | discovering cohorts of pregnant women from social media for safety surveillance and analysis |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5684515/ https://www.ncbi.nlm.nih.gov/pubmed/29084707 http://dx.doi.org/10.2196/jmir.8164 |
work_keys_str_mv | AT sarkerabeed discoveringcohortsofpregnantwomenfromsocialmediaforsafetysurveillanceandanalysis AT chandrashekarpramod discoveringcohortsofpregnantwomenfromsocialmediaforsafetysurveillanceandanalysis AT maggearjun discoveringcohortsofpregnantwomenfromsocialmediaforsafetysurveillanceandanalysis AT caihaitao discoveringcohortsofpregnantwomenfromsocialmediaforsafetysurveillanceandanalysis AT kleinari discoveringcohortsofpregnantwomenfromsocialmediaforsafetysurveillanceandanalysis AT gonzalezgraciela discoveringcohortsofpregnantwomenfromsocialmediaforsafetysurveillanceandanalysis |