Cargando…

ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models

The ChatSubs dataset [5] contains dialogue data in Spanish and three of Spain's co-official languages (Catalan, Basque, and Galician). It has been obtained from OpenSubtitles, from which we have gathered the movie subtitles in our languages of interest and processed them to generate clearly seg...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kharitonova, Ksenia, Callejas, Zoraida, Pérez-Fernández, David, Gutiérrez-Fandiño, Asier, Griol, David
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2023
Materias:	Data Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10519822/ https://www.ncbi.nlm.nih.gov/pubmed/37767123 http://dx.doi.org/10.1016/j.dib.2023.109565

_version_	1785109777664180224
author	Kharitonova, Ksenia Callejas, Zoraida Pérez-Fernández, David Gutiérrez-Fandiño, Asier Griol, David
author_facet	Kharitonova, Ksenia Callejas, Zoraida Pérez-Fernández, David Gutiérrez-Fandiño, Asier Griol, David
author_sort	Kharitonova, Ksenia
collection	PubMed
description	The ChatSubs dataset [5] contains dialogue data in Spanish and three of Spain's co-official languages (Catalan, Basque, and Galician). It has been obtained from OpenSubtitles, from which we have gathered the movie subtitles in our languages of interest and processed them to generate clearly segmented dialogues and their turns. The data processing code is publicly accessible. The result is 206.706 JSON files with more than 20 million dialogues and 96 million turns, which represents one of the biggest dialogue corpus available, as other similar datasets in better resourced languages do not reach 500k dialogues or present less defined conversations. Thus, the ChatSubs dataset is an ideal resource for research teams that are interested in training dialogue models in Spanish, Catalan, Basque, and Galician.
format	Online Article Text
id	pubmed-10519822
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-105198222023-09-27 ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models Kharitonova, Ksenia Callejas, Zoraida Pérez-Fernández, David Gutiérrez-Fandiño, Asier Griol, David Data Brief Data Article The ChatSubs dataset [5] contains dialogue data in Spanish and three of Spain's co-official languages (Catalan, Basque, and Galician). It has been obtained from OpenSubtitles, from which we have gathered the movie subtitles in our languages of interest and processed them to generate clearly segmented dialogues and their turns. The data processing code is publicly accessible. The result is 206.706 JSON files with more than 20 million dialogues and 96 million turns, which represents one of the biggest dialogue corpus available, as other similar datasets in better resourced languages do not reach 500k dialogues or present less defined conversations. Thus, the ChatSubs dataset is an ideal resource for research teams that are interested in training dialogue models in Spanish, Catalan, Basque, and Galician. Elsevier 2023-09-14 /pmc/articles/PMC10519822/ /pubmed/37767123 http://dx.doi.org/10.1016/j.dib.2023.109565 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Data Article Kharitonova, Ksenia Callejas, Zoraida Pérez-Fernández, David Gutiérrez-Fandiño, Asier Griol, David ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models
title	ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models
title_full	ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models
title_fullStr	ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models
title_full_unstemmed	ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models
title_short	ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models
title_sort	chatsubs: a dataset of dialogues in spanish, catalan, basque and galician extracted from movie subtitles for developing advanced conversational models
topic	Data Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10519822/ https://www.ncbi.nlm.nih.gov/pubmed/37767123 http://dx.doi.org/10.1016/j.dib.2023.109565
work_keys_str_mv	AT kharitonovaksenia chatsubsadatasetofdialoguesinspanishcatalanbasqueandgalicianextractedfrommoviesubtitlesfordevelopingadvancedconversationalmodels AT callejaszoraida chatsubsadatasetofdialoguesinspanishcatalanbasqueandgalicianextractedfrommoviesubtitlesfordevelopingadvancedconversationalmodels AT perezfernandezdavid chatsubsadatasetofdialoguesinspanishcatalanbasqueandgalicianextractedfrommoviesubtitlesfordevelopingadvancedconversationalmodels AT gutierrezfandinoasier chatsubsadatasetofdialoguesinspanishcatalanbasqueandgalicianextractedfrommoviesubtitlesfordevelopingadvancedconversationalmodels AT grioldavid chatsubsadatasetofdialoguesinspanishcatalanbasqueandgalicianextractedfrommoviesubtitlesfordevelopingadvancedconversationalmodels

ChatSubs: A dataset of dialogues in Spanish, Catalan, Basque and Galician extracted from movie subtitles for developing advanced conversational models

Ejemplares similares