Cargando…

Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach

This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data...

Descripción completa

Detalles Bibliográficos
Autores principales: Rei, Ricardo, Guerreiro, Nuno Miguel, Batista, Fernando
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274347/
http://dx.doi.org/10.1007/978-3-030-50146-4_52
_version_ 1783542561982382080
author Rei, Ricardo
Guerreiro, Nuno Miguel
Batista, Fernando
author_facet Rei, Ricardo
Guerreiro, Nuno Miguel
Batista, Fernando
author_sort Rei, Ricardo
collection PubMed
description This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data for training when compared with traditional approaches, and is able to achieve state-of-the-art results. The paper reports experiments both on general written data from the European Parliament, and on video subtitles, revealing that the proposed approach is suitable for performing capitalization, not only in each one of the domains, but also in a cross-domain scenario. We have also created a versatile multilingual model, and the conducted experiments show that good results can be achieved both for monolingual and multilingual data. Finally, we applied domain adaptation by finetuning models, initially trained on general written data, on video subtitles, revealing gains over other approaches not only in performance but also in terms of computational cost.
format Online
Article
Text
id pubmed-7274347
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72743472020-06-05 Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach Rei, Ricardo Guerreiro, Nuno Miguel Batista, Fernando Information Processing and Management of Uncertainty in Knowledge-Based Systems Article This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data for training when compared with traditional approaches, and is able to achieve state-of-the-art results. The paper reports experiments both on general written data from the European Parliament, and on video subtitles, revealing that the proposed approach is suitable for performing capitalization, not only in each one of the domains, but also in a cross-domain scenario. We have also created a versatile multilingual model, and the conducted experiments show that good results can be achieved both for monolingual and multilingual data. Finally, we applied domain adaptation by finetuning models, initially trained on general written data, on video subtitles, revealing gains over other approaches not only in performance but also in terms of computational cost. 2020-05-18 /pmc/articles/PMC7274347/ http://dx.doi.org/10.1007/978-3-030-50146-4_52 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Rei, Ricardo
Guerreiro, Nuno Miguel
Batista, Fernando
Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach
title Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach
title_full Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach
title_fullStr Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach
title_full_unstemmed Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach
title_short Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach
title_sort automatic truecasing of video subtitles using bert: a multilingual adaptable approach
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274347/
http://dx.doi.org/10.1007/978-3-030-50146-4_52
work_keys_str_mv AT reiricardo automatictruecasingofvideosubtitlesusingbertamultilingualadaptableapproach
AT guerreironunomiguel automatictruecasingofvideosubtitlesusingbertamultilingualadaptableapproach
AT batistafernando automatictruecasingofvideosubtitlesusingbertamultilingualadaptableapproach