Cargando…

Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods

Researchers and policy makers worldwide are interested in measuring the subjective well-being of populations. When users post on social media, they leave behind digital traces that reflect their thoughts and feelings. Aggregation of such digital traces may make it possible to monitor well-being at l...

Descripción completa

Detalles Bibliográficos
Autores principales: Jaidka, Kokil, Giorgi, Salvatore, Schwartz, H. Andrew, Kern, Margaret L., Ungar, Lyle H., Eichstaedt, Johannes C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7229753/
https://www.ncbi.nlm.nih.gov/pubmed/32341156
http://dx.doi.org/10.1073/pnas.1906364117
_version_ 1783534816721895424
author Jaidka, Kokil
Giorgi, Salvatore
Schwartz, H. Andrew
Kern, Margaret L.
Ungar, Lyle H.
Eichstaedt, Johannes C.
author_facet Jaidka, Kokil
Giorgi, Salvatore
Schwartz, H. Andrew
Kern, Margaret L.
Ungar, Lyle H.
Eichstaedt, Johannes C.
author_sort Jaidka, Kokil
collection PubMed
description Researchers and policy makers worldwide are interested in measuring the subjective well-being of populations. When users post on social media, they leave behind digital traces that reflect their thoughts and feelings. Aggregation of such digital traces may make it possible to monitor well-being at large scale. However, social media-based methods need to be robust to regional effects if they are to produce reliable estimates. Using a sample of 1.53 billion geotagged English tweets, we provide a systematic evaluation of word-level and data-driven methods for text analysis for generating well-being estimates for 1,208 US counties. We compared Twitter-based county-level estimates with well-being measurements provided by the Gallup-Sharecare Well-Being Index survey through 1.73 million phone surveys. We find that word-level methods (e.g., Linguistic Inquiry and Word Count [LIWC] 2015 and Language Assessment by Mechanical Turk [LabMT]) yielded inconsistent county-level well-being measurements due to regional, cultural, and socioeconomic differences in language use. However, removing as few as three of the most frequent words led to notable improvements in well-being prediction. Data-driven methods provided robust estimates, approximating the Gallup data at up to r = 0.64. We show that the findings generalized to county socioeconomic and health outcomes and were robust when poststratifying the samples to be more representative of the general US population. Regional well-being estimation from social media data seems to be robust when supervised data-driven methods are used.
format Online
Article
Text
id pubmed-7229753
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-72297532020-05-26 Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods Jaidka, Kokil Giorgi, Salvatore Schwartz, H. Andrew Kern, Margaret L. Ungar, Lyle H. Eichstaedt, Johannes C. Proc Natl Acad Sci U S A Physical Sciences Researchers and policy makers worldwide are interested in measuring the subjective well-being of populations. When users post on social media, they leave behind digital traces that reflect their thoughts and feelings. Aggregation of such digital traces may make it possible to monitor well-being at large scale. However, social media-based methods need to be robust to regional effects if they are to produce reliable estimates. Using a sample of 1.53 billion geotagged English tweets, we provide a systematic evaluation of word-level and data-driven methods for text analysis for generating well-being estimates for 1,208 US counties. We compared Twitter-based county-level estimates with well-being measurements provided by the Gallup-Sharecare Well-Being Index survey through 1.73 million phone surveys. We find that word-level methods (e.g., Linguistic Inquiry and Word Count [LIWC] 2015 and Language Assessment by Mechanical Turk [LabMT]) yielded inconsistent county-level well-being measurements due to regional, cultural, and socioeconomic differences in language use. However, removing as few as three of the most frequent words led to notable improvements in well-being prediction. Data-driven methods provided robust estimates, approximating the Gallup data at up to r = 0.64. We show that the findings generalized to county socioeconomic and health outcomes and were robust when poststratifying the samples to be more representative of the general US population. Regional well-being estimation from social media data seems to be robust when supervised data-driven methods are used. National Academy of Sciences 2020-05-12 2020-04-27 /pmc/articles/PMC7229753/ /pubmed/32341156 http://dx.doi.org/10.1073/pnas.1906364117 Text en Copyright © 2020 the Author(s). Published by PNAS. http://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY) (http://creativecommons.org/licenses/by/4.0/) .
spellingShingle Physical Sciences
Jaidka, Kokil
Giorgi, Salvatore
Schwartz, H. Andrew
Kern, Margaret L.
Ungar, Lyle H.
Eichstaedt, Johannes C.
Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods
title Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods
title_full Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods
title_fullStr Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods
title_full_unstemmed Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods
title_short Estimating geographic subjective well-being from Twitter: A comparison of dictionary and data-driven language methods
title_sort estimating geographic subjective well-being from twitter: a comparison of dictionary and data-driven language methods
topic Physical Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7229753/
https://www.ncbi.nlm.nih.gov/pubmed/32341156
http://dx.doi.org/10.1073/pnas.1906364117
work_keys_str_mv AT jaidkakokil estimatinggeographicsubjectivewellbeingfromtwitteracomparisonofdictionaryanddatadrivenlanguagemethods
AT giorgisalvatore estimatinggeographicsubjectivewellbeingfromtwitteracomparisonofdictionaryanddatadrivenlanguagemethods
AT schwartzhandrew estimatinggeographicsubjectivewellbeingfromtwitteracomparisonofdictionaryanddatadrivenlanguagemethods
AT kernmargaretl estimatinggeographicsubjectivewellbeingfromtwitteracomparisonofdictionaryanddatadrivenlanguagemethods
AT ungarlyleh estimatinggeographicsubjectivewellbeingfromtwitteracomparisonofdictionaryanddatadrivenlanguagemethods
AT eichstaedtjohannesc estimatinggeographicsubjectivewellbeingfromtwitteracomparisonofdictionaryanddatadrivenlanguagemethods