Cargando…

DUKweb, diachronic word representations from the UK Web Archive corpus

Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have beco...

Descripción completa

Detalles Bibliográficos
Autores principales: Tsakalidis, Adam, Basile, Pierpaolo, Bazzi, Marya, Cucuringu, Mihai, McGillivray, Barbara
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8520005/
https://www.ncbi.nlm.nih.gov/pubmed/34654827
http://dx.doi.org/10.1038/s41597-021-01047-x
_version_ 1784584574119968768
author Tsakalidis, Adam
Basile, Pierpaolo
Bazzi, Marya
Cucuringu, Mihai
McGillivray, Barbara
author_facet Tsakalidis, Adam
Basile, Pierpaolo
Bazzi, Marya
Cucuringu, Mihai
McGillivray, Barbara
author_sort Tsakalidis, Adam
collection PubMed
description Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996–2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in ‘.uk’. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection.
format Online
Article
Text
id pubmed-8520005
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-85200052021-10-29 DUKweb, diachronic word representations from the UK Web Archive corpus Tsakalidis, Adam Basile, Pierpaolo Bazzi, Marya Cucuringu, Mihai McGillivray, Barbara Sci Data Data Descriptor Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996–2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in ‘.uk’. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection. Nature Publishing Group UK 2021-10-15 /pmc/articles/PMC8520005/ /pubmed/34654827 http://dx.doi.org/10.1038/s41597-021-01047-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) applies to the metadata files associated with this article.
spellingShingle Data Descriptor
Tsakalidis, Adam
Basile, Pierpaolo
Bazzi, Marya
Cucuringu, Mihai
McGillivray, Barbara
DUKweb, diachronic word representations from the UK Web Archive corpus
title DUKweb, diachronic word representations from the UK Web Archive corpus
title_full DUKweb, diachronic word representations from the UK Web Archive corpus
title_fullStr DUKweb, diachronic word representations from the UK Web Archive corpus
title_full_unstemmed DUKweb, diachronic word representations from the UK Web Archive corpus
title_short DUKweb, diachronic word representations from the UK Web Archive corpus
title_sort dukweb, diachronic word representations from the uk web archive corpus
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8520005/
https://www.ncbi.nlm.nih.gov/pubmed/34654827
http://dx.doi.org/10.1038/s41597-021-01047-x
work_keys_str_mv AT tsakalidisadam dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus
AT basilepierpaolo dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus
AT bazzimarya dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus
AT cucuringumihai dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus
AT mcgillivraybarbara dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus