Cargando…

A learner corpus is born this way: From raw data to processed dataset

This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical ph...

Descripción completa

Detalles Bibliográficos
Autores principales: Leung, Chung Hong Danny, Chow, Mei Yung Vanliza, Ge, Haoyan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9421320/
https://www.ncbi.nlm.nih.gov/pubmed/36045644
http://dx.doi.org/10.1016/j.dib.2022.108527
_version_ 1784777565749116928
author Leung, Chung Hong Danny
Chow, Mei Yung Vanliza
Ge, Haoyan
author_facet Leung, Chung Hong Danny
Chow, Mei Yung Vanliza
Ge, Haoyan
author_sort Leung, Chung Hong Danny
collection PubMed
description This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical phases where the processed text data and meta data were aligned and transformed to the web interface of the corpus. The corpus developed is called the CELL (Chinese and English Learner Language) Corpus, which comprises: i) text data containing 4.2 million English words and 18 million Chinese characters; and ii) meta data including the demographic information of the participants whose text data were collected. This article first outlines the steps for collecting the text data and meta data and then explains the processes for cleaning, annotating and tagging the text data. Discussion of the problems the research team encountered with segmentation of the Chinese text data and accuracy check of the processed datasets is also included in this article. The CELL Corpus comes with the concordance and word list features which will enable language teachers and researchers to investigate frequency, accuracy and complexity of vocabulary use in learner language. The steps and processes reported in this article will inform future development of learner language corpora of different languages.
format Online
Article
Text
id pubmed-9421320
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-94213202022-08-30 A learner corpus is born this way: From raw data to processed dataset Leung, Chung Hong Danny Chow, Mei Yung Vanliza Ge, Haoyan Data Brief Data Article This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical phases where the processed text data and meta data were aligned and transformed to the web interface of the corpus. The corpus developed is called the CELL (Chinese and English Learner Language) Corpus, which comprises: i) text data containing 4.2 million English words and 18 million Chinese characters; and ii) meta data including the demographic information of the participants whose text data were collected. This article first outlines the steps for collecting the text data and meta data and then explains the processes for cleaning, annotating and tagging the text data. Discussion of the problems the research team encountered with segmentation of the Chinese text data and accuracy check of the processed datasets is also included in this article. The CELL Corpus comes with the concordance and word list features which will enable language teachers and researchers to investigate frequency, accuracy and complexity of vocabulary use in learner language. The steps and processes reported in this article will inform future development of learner language corpora of different languages. Elsevier 2022-08-08 /pmc/articles/PMC9421320/ /pubmed/36045644 http://dx.doi.org/10.1016/j.dib.2022.108527 Text en © 2022 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Leung, Chung Hong Danny
Chow, Mei Yung Vanliza
Ge, Haoyan
A learner corpus is born this way: From raw data to processed dataset
title A learner corpus is born this way: From raw data to processed dataset
title_full A learner corpus is born this way: From raw data to processed dataset
title_fullStr A learner corpus is born this way: From raw data to processed dataset
title_full_unstemmed A learner corpus is born this way: From raw data to processed dataset
title_short A learner corpus is born this way: From raw data to processed dataset
title_sort learner corpus is born this way: from raw data to processed dataset
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9421320/
https://www.ncbi.nlm.nih.gov/pubmed/36045644
http://dx.doi.org/10.1016/j.dib.2022.108527
work_keys_str_mv AT leungchunghongdanny alearnercorpusisbornthiswayfromrawdatatoprocesseddataset
AT chowmeiyungvanliza alearnercorpusisbornthiswayfromrawdatatoprocesseddataset
AT gehaoyan alearnercorpusisbornthiswayfromrawdatatoprocesseddataset
AT leungchunghongdanny learnercorpusisbornthiswayfromrawdatatoprocesseddataset
AT chowmeiyungvanliza learnercorpusisbornthiswayfromrawdatatoprocesseddataset
AT gehaoyan learnercorpusisbornthiswayfromrawdatatoprocesseddataset