Cargando…

Crowdsourcing Dialect Characterization through Twitter

We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a...

Descripción completa

Detalles Bibliográficos
Autores principales: Gonçalves, Bruno, Sánchez, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237322/
https://www.ncbi.nlm.nih.gov/pubmed/25409174
http://dx.doi.org/10.1371/journal.pone.0112074
_version_ 1782345322681335808
author Gonçalves, Bruno
Sánchez, David
author_facet Gonçalves, Bruno
Sánchez, David
author_sort Gonçalves, Bruno
collection PubMed
description We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.
format Online
Article
Text
id pubmed-4237322
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-42373222014-11-21 Crowdsourcing Dialect Characterization through Twitter Gonçalves, Bruno Sánchez, David PLoS One Research Article We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character. Public Library of Science 2014-11-19 /pmc/articles/PMC4237322/ /pubmed/25409174 http://dx.doi.org/10.1371/journal.pone.0112074 Text en © 2014 Gonçalves, Sánchez http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Gonçalves, Bruno
Sánchez, David
Crowdsourcing Dialect Characterization through Twitter
title Crowdsourcing Dialect Characterization through Twitter
title_full Crowdsourcing Dialect Characterization through Twitter
title_fullStr Crowdsourcing Dialect Characterization through Twitter
title_full_unstemmed Crowdsourcing Dialect Characterization through Twitter
title_short Crowdsourcing Dialect Characterization through Twitter
title_sort crowdsourcing dialect characterization through twitter
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237322/
https://www.ncbi.nlm.nih.gov/pubmed/25409174
http://dx.doi.org/10.1371/journal.pone.0112074
work_keys_str_mv AT goncalvesbruno crowdsourcingdialectcharacterizationthroughtwitter
AT sanchezdavid crowdsourcingdialectcharacterizationthroughtwitter