Cargando…

NCHLT Auxiliary speech data for ASR technology development in South Africa

The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT pro...

Descripción completa

Detalles Bibliográficos
Autores principales: Badenhorst, Jaco, de Wet, Febe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8814303/
https://www.ncbi.nlm.nih.gov/pubmed/35141372
http://dx.doi.org/10.1016/j.dib.2022.107860
_version_ 1784645024400539648
author Badenhorst, Jaco
de Wet, Febe
author_facet Badenhorst, Jaco
de Wet, Febe
author_sort Badenhorst, Jaco
collection PubMed
description The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology.
format Online
Article
Text
id pubmed-8814303
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-88143032022-02-08 NCHLT Auxiliary speech data for ASR technology development in South Africa Badenhorst, Jaco de Wet, Febe Data Brief Data Article The aim of the National Centre for Human Language Technology (NCHLT) project was to create speech and text resources that would enable Human Language Technology (HLT) development for the 11 official languages of South Africa. The speech data described in this paper was collected during the NCHLT project using a smartphone application. The official NCHLT Speech Corpus was released in 2014, but it did not include all recordings that were made during the data collection campaign. This paper describes the additional data that was recently released as auxiliary corpora [2]. The auxiliary data sets contain between 20 and 170 hours of speech data per language as well as the transcriptions associated with each utterance. In terms of the resources required for HLT development South Africa’s official languages are all under-resourced. The data described in this paper contributes toward alleviating this situation, specifically for the development of speech technology. Elsevier 2022-01-21 /pmc/articles/PMC8814303/ /pubmed/35141372 http://dx.doi.org/10.1016/j.dib.2022.107860 Text en © 2022 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Badenhorst, Jaco
de Wet, Febe
NCHLT Auxiliary speech data for ASR technology development in South Africa
title NCHLT Auxiliary speech data for ASR technology development in South Africa
title_full NCHLT Auxiliary speech data for ASR technology development in South Africa
title_fullStr NCHLT Auxiliary speech data for ASR technology development in South Africa
title_full_unstemmed NCHLT Auxiliary speech data for ASR technology development in South Africa
title_short NCHLT Auxiliary speech data for ASR technology development in South Africa
title_sort nchlt auxiliary speech data for asr technology development in south africa
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8814303/
https://www.ncbi.nlm.nih.gov/pubmed/35141372
http://dx.doi.org/10.1016/j.dib.2022.107860
work_keys_str_mv AT badenhorstjaco nchltauxiliaryspeechdataforasrtechnologydevelopmentinsouthafrica
AT dewetfebe nchltauxiliaryspeechdataforasrtechnologydevelopmentinsouthafrica