Cargando…

An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter

Despite the prevalence in the United States of miscarriage [1], stillbirth [2], and infant mortality associated with preterm birth and low birthweight [3], their causes remain largely unknown [4], [5], [6]. To advance the use of social media data as a complementary resource for epidemiology of adver...

Descripción completa

Detalles Bibliográficos
Autores principales: Klein, Ari Z., Gonzalez-Hernandez, Graciela
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7481818/
https://www.ncbi.nlm.nih.gov/pubmed/32944604
http://dx.doi.org/10.1016/j.dib.2020.106249
_version_ 1783580686317256704
author Klein, Ari Z.
Gonzalez-Hernandez, Graciela
author_facet Klein, Ari Z.
Gonzalez-Hernandez, Graciela
author_sort Klein, Ari Z.
collection PubMed
description Despite the prevalence in the United States of miscarriage [1], stillbirth [2], and infant mortality associated with preterm birth and low birthweight [3], their causes remain largely unknown [4], [5], [6]. To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7]. Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome (“outcome” tweets) from those that merely mention the outcome (“non-outcome” tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as “outcome” include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These “outcome” tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users’ broader timelines—tweets posted by a user over time—for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in “A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes” [10].
format Online
Article
Text
id pubmed-7481818
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-74818182020-09-16 An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter Klein, Ari Z. Gonzalez-Hernandez, Graciela Data Brief Data Article Despite the prevalence in the United States of miscarriage [1], stillbirth [2], and infant mortality associated with preterm birth and low birthweight [3], their causes remain largely unknown [4], [5], [6]. To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7]. Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome (“outcome” tweets) from those that merely mention the outcome (“non-outcome” tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as “outcome” include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These “outcome” tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users’ broader timelines—tweets posted by a user over time—for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in “A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes” [10]. Elsevier 2020-08-31 /pmc/articles/PMC7481818/ /pubmed/32944604 http://dx.doi.org/10.1016/j.dib.2020.106249 Text en © 2020 The Authors http://creativecommons.org/licenses/by-nc-nd/4.0/ This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Data Article
Klein, Ari Z.
Gonzalez-Hernandez, Graciela
An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
title An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
title_full An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
title_fullStr An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
title_full_unstemmed An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
title_short An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
title_sort annotated data set for identifying women reporting adverse pregnancy outcomes on twitter
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7481818/
https://www.ncbi.nlm.nih.gov/pubmed/32944604
http://dx.doi.org/10.1016/j.dib.2020.106249
work_keys_str_mv AT kleinariz anannotateddatasetforidentifyingwomenreportingadversepregnancyoutcomesontwitter
AT gonzalezhernandezgraciela anannotateddatasetforidentifyingwomenreportingadversepregnancyoutcomesontwitter
AT kleinariz annotateddatasetforidentifyingwomenreportingadversepregnancyoutcomesontwitter
AT gonzalezhernandezgraciela annotateddatasetforidentifyingwomenreportingadversepregnancyoutcomesontwitter