Cargando…

Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents

The Internet offers great possibilities for many scientific disciplines that utilize text data. However, the potential of online data can be limited by the lack of information on the genre or register of the documents, as register—whether a text is, e.g., a news article or a recipe—is arguably the m...

Descripción completa

Detalles Bibliográficos
Autores principales: Laippala, Veronika, Egbert, Jesse, Biber, Douglas, Kyröläinen, Aki-Juhani
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Netherlands 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8550160/
https://www.ncbi.nlm.nih.gov/pubmed/34720782
http://dx.doi.org/10.1007/s10579-020-09519-z
_version_ 1784590903549100032
author Laippala, Veronika
Egbert, Jesse
Biber, Douglas
Kyröläinen, Aki-Juhani
author_facet Laippala, Veronika
Egbert, Jesse
Biber, Douglas
Kyröläinen, Aki-Juhani
author_sort Laippala, Veronika
collection PubMed
description The Internet offers great possibilities for many scientific disciplines that utilize text data. However, the potential of online data can be limited by the lack of information on the genre or register of the documents, as register—whether a text is, e.g., a news article or a recipe—is arguably the most important predictor of linguistic variation (see Biber in Corpus Linguist Linguist Theory 8:9–37, 2012). Despite having received significant attention in recent years, the modeling of online registers has faced a number of challenges, and previous studies have presented contradictory results. In particular, these have concerned (1) the extent to which registers can be automatically identified in a large, unrestricted corpus of web documents and (2) the stability of the models, specifically the kinds of linguistic features that achieve the best performance while reflecting the registers instead of corpus idiosyncrasies. Furthermore, although the linguistic properties of registers vary importantly in a number of ways that may affect their modeling, this variation is often bypassed. In this article, we tackle these issues. We model online registers in the largest available corpus of online registers, the Corpus of Online Registers of English (CORE). Additionally, we evaluate the stability of the models towards corpus idiosyncrasies, analyze the role of different linguistic features in them, and examine how individual registers differ in these two aspects. We show that (1) competitive classification performance on a large-scale, unrestricted corpus can be achieved through a combination of lexico-grammatical features, (2) the inclusion of grammatical information improves the stability of the model, whereas many of the previously best-performing feature sets are less stable, and that (3) registers can be placed in a continuum based on the discriminative importance of lexis and grammar. These register-specific characteristics can explain the variation observed in previous studies concerning the automatic identification of online registers and the importance of different linguistic features for them. Thus, our results offer explanations for the jungle-likeness of online data and provide essential information on online registers for all studies using online data.
format Online
Article
Text
id pubmed-8550160
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer Netherlands
record_format MEDLINE/PubMed
spelling pubmed-85501602021-10-29 Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents Laippala, Veronika Egbert, Jesse Biber, Douglas Kyröläinen, Aki-Juhani Lang Resour Eval Original Paper The Internet offers great possibilities for many scientific disciplines that utilize text data. However, the potential of online data can be limited by the lack of information on the genre or register of the documents, as register—whether a text is, e.g., a news article or a recipe—is arguably the most important predictor of linguistic variation (see Biber in Corpus Linguist Linguist Theory 8:9–37, 2012). Despite having received significant attention in recent years, the modeling of online registers has faced a number of challenges, and previous studies have presented contradictory results. In particular, these have concerned (1) the extent to which registers can be automatically identified in a large, unrestricted corpus of web documents and (2) the stability of the models, specifically the kinds of linguistic features that achieve the best performance while reflecting the registers instead of corpus idiosyncrasies. Furthermore, although the linguistic properties of registers vary importantly in a number of ways that may affect their modeling, this variation is often bypassed. In this article, we tackle these issues. We model online registers in the largest available corpus of online registers, the Corpus of Online Registers of English (CORE). Additionally, we evaluate the stability of the models towards corpus idiosyncrasies, analyze the role of different linguistic features in them, and examine how individual registers differ in these two aspects. We show that (1) competitive classification performance on a large-scale, unrestricted corpus can be achieved through a combination of lexico-grammatical features, (2) the inclusion of grammatical information improves the stability of the model, whereas many of the previously best-performing feature sets are less stable, and that (3) registers can be placed in a continuum based on the discriminative importance of lexis and grammar. These register-specific characteristics can explain the variation observed in previous studies concerning the automatic identification of online registers and the importance of different linguistic features for them. Thus, our results offer explanations for the jungle-likeness of online data and provide essential information on online registers for all studies using online data. Springer Netherlands 2021-01-25 2021 /pmc/articles/PMC8550160/ /pubmed/34720782 http://dx.doi.org/10.1007/s10579-020-09519-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Paper
Laippala, Veronika
Egbert, Jesse
Biber, Douglas
Kyröläinen, Aki-Juhani
Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
title Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
title_full Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
title_fullStr Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
title_full_unstemmed Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
title_short Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
title_sort exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8550160/
https://www.ncbi.nlm.nih.gov/pubmed/34720782
http://dx.doi.org/10.1007/s10579-020-09519-z
work_keys_str_mv AT laippalaveronika exploringtheroleoflexisandgrammarforthestableidentificationofregisterinanunrestrictedcorpusofwebdocuments
AT egbertjesse exploringtheroleoflexisandgrammarforthestableidentificationofregisterinanunrestrictedcorpusofwebdocuments
AT biberdouglas exploringtheroleoflexisandgrammarforthestableidentificationofregisterinanunrestrictedcorpusofwebdocuments
AT kyrolainenakijuhani exploringtheroleoflexisandgrammarforthestableidentificationofregisterinanunrestrictedcorpusofwebdocuments