Cargando…

Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study

This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challengin...

Descripción completa

Detalles Bibliográficos
Autores principales: Ghyselen, Anne-Sophie, Breitbarth, Anne, Farasyn, Melissa, Van Keymeulen, Jacques, van Hessen, Arjan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7861295/
https://www.ncbi.nlm.nih.gov/pubmed/33733130
http://dx.doi.org/10.3389/frai.2020.00010
_version_ 1783647055438151680
author Ghyselen, Anne-Sophie
Breitbarth, Anne
Farasyn, Melissa
Van Keymeulen, Jacques
van Hessen, Arjan
author_facet Ghyselen, Anne-Sophie
Breitbarth, Anne
Farasyn, Melissa
Van Keymeulen, Jacques
van Hessen, Arjan
author_sort Ghyselen, Anne-Sophie
collection PubMed
description This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challenging nature of transcribing dialects, given a lack of both orthographic norms for many dialects and speech technological tools trained on dialect data. This paper addresses the questions (i) how dialects can be transcribed efficiently and (ii) whether speech technological tools can lighten the transcription work. These questions are tackled using the Southern Dutch dialects (SDDs) as case study, for which the usefulness of automatic speech recognition (ASR), respeaking, and forced alignment is considered. Tests with these tools indicate that dialects still constitute a major speech technological challenge. In the case of the SDDs, the decision was made to use speech technology only for the word-level segmentation of the audio files, as the transcription itself could not be sped up by ASR tools. The discussion does however indicate that the usefulness of ASR and other related tools for a dialect corpus project is strongly determined by the sound quality of the dialect recordings, the availability of statistical dialect-specific models, the degree of linguistic differentiation between the dialects and the standard language, and the goals the transcripts have to serve.
format Online
Article
Text
id pubmed-7861295
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-78612952021-03-16 Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study Ghyselen, Anne-Sophie Breitbarth, Anne Farasyn, Melissa Van Keymeulen, Jacques van Hessen, Arjan Front Artif Intell Artificial Intelligence This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challenging nature of transcribing dialects, given a lack of both orthographic norms for many dialects and speech technological tools trained on dialect data. This paper addresses the questions (i) how dialects can be transcribed efficiently and (ii) whether speech technological tools can lighten the transcription work. These questions are tackled using the Southern Dutch dialects (SDDs) as case study, for which the usefulness of automatic speech recognition (ASR), respeaking, and forced alignment is considered. Tests with these tools indicate that dialects still constitute a major speech technological challenge. In the case of the SDDs, the decision was made to use speech technology only for the word-level segmentation of the audio files, as the transcription itself could not be sped up by ASR tools. The discussion does however indicate that the usefulness of ASR and other related tools for a dialect corpus project is strongly determined by the sound quality of the dialect recordings, the availability of statistical dialect-specific models, the degree of linguistic differentiation between the dialects and the standard language, and the goals the transcripts have to serve. Frontiers Media S.A. 2020-04-15 /pmc/articles/PMC7861295/ /pubmed/33733130 http://dx.doi.org/10.3389/frai.2020.00010 Text en Copyright © 2020 Ghyselen, Breitbarth, Farasyn, Van Keymeulen and van Hessen. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Artificial Intelligence
Ghyselen, Anne-Sophie
Breitbarth, Anne
Farasyn, Melissa
Van Keymeulen, Jacques
van Hessen, Arjan
Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study
title Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study
title_full Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study
title_fullStr Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study
title_full_unstemmed Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study
title_short Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study
title_sort clearing the transcription hurdle in dialect corpus building: the corpus of southern dutch dialects as case study
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7861295/
https://www.ncbi.nlm.nih.gov/pubmed/33733130
http://dx.doi.org/10.3389/frai.2020.00010
work_keys_str_mv AT ghyselenannesophie clearingthetranscriptionhurdleindialectcorpusbuildingthecorpusofsoutherndutchdialectsascasestudy
AT breitbarthanne clearingthetranscriptionhurdleindialectcorpusbuildingthecorpusofsoutherndutchdialectsascasestudy
AT farasynmelissa clearingthetranscriptionhurdleindialectcorpusbuildingthecorpusofsoutherndutchdialectsascasestudy
AT vankeymeulenjacques clearingthetranscriptionhurdleindialectcorpusbuildingthecorpusofsoutherndutchdialectsascasestudy
AT vanhessenarjan clearingthetranscriptionhurdleindialectcorpusbuildingthecorpusofsoutherndutchdialectsascasestudy