Cargando…

Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources

More and more, social scientists are using (big) digital behavioral data for their research. In this context, the social network and microblogging platform Twitter is one of the most widely used data sources. In particular, geospatial analyses of Twitter data are proving to be fruitful for examining...

Descripción completa

Detalles Bibliográficos
Autores principales: Nguyen, H. Long, Tsolak, Dorian, Karmann, Anna, Knauff, Stefan, Kühne, Simon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9220088/
https://www.ncbi.nlm.nih.gov/pubmed/35755485
http://dx.doi.org/10.3389/fsoc.2022.910111
_version_ 1784732285944201216
author Nguyen, H. Long
Tsolak, Dorian
Karmann, Anna
Knauff, Stefan
Kühne, Simon
author_facet Nguyen, H. Long
Tsolak, Dorian
Karmann, Anna
Knauff, Stefan
Kühne, Simon
author_sort Nguyen, H. Long
collection PubMed
description More and more, social scientists are using (big) digital behavioral data for their research. In this context, the social network and microblogging platform Twitter is one of the most widely used data sources. In particular, geospatial analyses of Twitter data are proving to be fruitful for examining regional differences in user behavior and attitudes. However, ready-to-use spatial information in the form of GPS coordinates is only available for a tiny fraction of Twitter data, limiting research potential and making it difficult to link with data from other sources (e.g., official statistics and survey data) for regional analyses. We address this problem by using the free text locations provided by Twitter users in their profiles to determine the corresponding real-world locations. Since users can enter any text as a profile location, automated identification of geographic locations based on this information is highly complicated. With our method, we are able to assign over a quarter of the more than 866 million German tweets collected to real locations in Germany. This represents a vast improvement over the 0.18% of tweets in our corpus to which Twitter assigns geographic coordinates. Based on the geocoding results, we are not only able to determine a corresponding place for users with valid profile locations, but also the administrative level to which the place belongs. Enriching Twitter data with this information ensures that they can be directly linked to external data sources at different levels of aggregation. We show possible use cases for the fine-grained spatial data generated by our method and how it can be used to answer previously inaccessible research questions in the social sciences. We also provide a companion R package, nutscoder, to facilitate reuse of the geocoding method in this paper.
format Online
Article
Text
id pubmed-9220088
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-92200882022-06-24 Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources Nguyen, H. Long Tsolak, Dorian Karmann, Anna Knauff, Stefan Kühne, Simon Front Sociol Sociology More and more, social scientists are using (big) digital behavioral data for their research. In this context, the social network and microblogging platform Twitter is one of the most widely used data sources. In particular, geospatial analyses of Twitter data are proving to be fruitful for examining regional differences in user behavior and attitudes. However, ready-to-use spatial information in the form of GPS coordinates is only available for a tiny fraction of Twitter data, limiting research potential and making it difficult to link with data from other sources (e.g., official statistics and survey data) for regional analyses. We address this problem by using the free text locations provided by Twitter users in their profiles to determine the corresponding real-world locations. Since users can enter any text as a profile location, automated identification of geographic locations based on this information is highly complicated. With our method, we are able to assign over a quarter of the more than 866 million German tweets collected to real locations in Germany. This represents a vast improvement over the 0.18% of tweets in our corpus to which Twitter assigns geographic coordinates. Based on the geocoding results, we are not only able to determine a corresponding place for users with valid profile locations, but also the administrative level to which the place belongs. Enriching Twitter data with this information ensures that they can be directly linked to external data sources at different levels of aggregation. We show possible use cases for the fine-grained spatial data generated by our method and how it can be used to answer previously inaccessible research questions in the social sciences. We also provide a companion R package, nutscoder, to facilitate reuse of the geocoding method in this paper. Frontiers Media S.A. 2022-06-09 /pmc/articles/PMC9220088/ /pubmed/35755485 http://dx.doi.org/10.3389/fsoc.2022.910111 Text en Copyright © 2022 Nguyen, Tsolak, Karmann, Knauff and Kühne. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Sociology
Nguyen, H. Long
Tsolak, Dorian
Karmann, Anna
Knauff, Stefan
Kühne, Simon
Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources
title Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources
title_full Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources
title_fullStr Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources
title_full_unstemmed Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources
title_short Efficient and Reliable Geocoding of German Twitter Data to Enable Spatial Data Linkage to Official Statistics and Other Data Sources
title_sort efficient and reliable geocoding of german twitter data to enable spatial data linkage to official statistics and other data sources
topic Sociology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9220088/
https://www.ncbi.nlm.nih.gov/pubmed/35755485
http://dx.doi.org/10.3389/fsoc.2022.910111
work_keys_str_mv AT nguyenhlong efficientandreliablegeocodingofgermantwitterdatatoenablespatialdatalinkagetoofficialstatisticsandotherdatasources
AT tsolakdorian efficientandreliablegeocodingofgermantwitterdatatoenablespatialdatalinkagetoofficialstatisticsandotherdatasources
AT karmannanna efficientandreliablegeocodingofgermantwitterdatatoenablespatialdatalinkagetoofficialstatisticsandotherdatasources
AT knauffstefan efficientandreliablegeocodingofgermantwitterdatatoenablespatialdatalinkagetoofficialstatisticsandotherdatasources
AT kuhnesimon efficientandreliablegeocodingofgermantwitterdatatoenablespatialdatalinkagetoofficialstatisticsandotherdatasources