Cargando…
Crowdsourcing Dialect Characterization through Twitter
We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237322/ https://www.ncbi.nlm.nih.gov/pubmed/25409174 http://dx.doi.org/10.1371/journal.pone.0112074 |
_version_ | 1782345322681335808 |
---|---|
author | Gonçalves, Bruno Sánchez, David |
author_facet | Gonçalves, Bruno Sánchez, David |
author_sort | Gonçalves, Bruno |
collection | PubMed |
description | We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character. |
format | Online Article Text |
id | pubmed-4237322 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-42373222014-11-21 Crowdsourcing Dialect Characterization through Twitter Gonçalves, Bruno Sánchez, David PLoS One Research Article We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character. Public Library of Science 2014-11-19 /pmc/articles/PMC4237322/ /pubmed/25409174 http://dx.doi.org/10.1371/journal.pone.0112074 Text en © 2014 Gonçalves, Sánchez http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Gonçalves, Bruno Sánchez, David Crowdsourcing Dialect Characterization through Twitter |
title | Crowdsourcing Dialect Characterization through Twitter |
title_full | Crowdsourcing Dialect Characterization through Twitter |
title_fullStr | Crowdsourcing Dialect Characterization through Twitter |
title_full_unstemmed | Crowdsourcing Dialect Characterization through Twitter |
title_short | Crowdsourcing Dialect Characterization through Twitter |
title_sort | crowdsourcing dialect characterization through twitter |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4237322/ https://www.ncbi.nlm.nih.gov/pubmed/25409174 http://dx.doi.org/10.1371/journal.pone.0112074 |
work_keys_str_mv | AT goncalvesbruno crowdsourcingdialectcharacterizationthroughtwitter AT sanchezdavid crowdsourcingdialectcharacterizationthroughtwitter |