Cargando…
DUKweb, diachronic word representations from the UK Web Archive corpus
Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have beco...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8520005/ https://www.ncbi.nlm.nih.gov/pubmed/34654827 http://dx.doi.org/10.1038/s41597-021-01047-x |
_version_ | 1784584574119968768 |
---|---|
author | Tsakalidis, Adam Basile, Pierpaolo Bazzi, Marya Cucuringu, Mihai McGillivray, Barbara |
author_facet | Tsakalidis, Adam Basile, Pierpaolo Bazzi, Marya Cucuringu, Mihai McGillivray, Barbara |
author_sort | Tsakalidis, Adam |
collection | PubMed |
description | Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996–2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in ‘.uk’. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection. |
format | Online Article Text |
id | pubmed-8520005 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-85200052021-10-29 DUKweb, diachronic word representations from the UK Web Archive corpus Tsakalidis, Adam Basile, Pierpaolo Bazzi, Marya Cucuringu, Mihai McGillivray, Barbara Sci Data Data Descriptor Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996–2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in ‘.uk’. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection. Nature Publishing Group UK 2021-10-15 /pmc/articles/PMC8520005/ /pubmed/34654827 http://dx.doi.org/10.1038/s41597-021-01047-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) applies to the metadata files associated with this article. |
spellingShingle | Data Descriptor Tsakalidis, Adam Basile, Pierpaolo Bazzi, Marya Cucuringu, Mihai McGillivray, Barbara DUKweb, diachronic word representations from the UK Web Archive corpus |
title | DUKweb, diachronic word representations from the UK Web Archive corpus |
title_full | DUKweb, diachronic word representations from the UK Web Archive corpus |
title_fullStr | DUKweb, diachronic word representations from the UK Web Archive corpus |
title_full_unstemmed | DUKweb, diachronic word representations from the UK Web Archive corpus |
title_short | DUKweb, diachronic word representations from the UK Web Archive corpus |
title_sort | dukweb, diachronic word representations from the uk web archive corpus |
topic | Data Descriptor |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8520005/ https://www.ncbi.nlm.nih.gov/pubmed/34654827 http://dx.doi.org/10.1038/s41597-021-01047-x |
work_keys_str_mv | AT tsakalidisadam dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus AT basilepierpaolo dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus AT bazzimarya dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus AT cucuringumihai dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus AT mcgillivraybarbara dukwebdiachronicwordrepresentationsfromtheukwebarchivecorpus |