Cargando…

Semantic Coherence Dataset: Speech transcripts

The Semantic Coherence Dataset has been designed to experiment with semantic coherence metrics. More specifically, the dataset has been built to the ends of testing whether probabilistic measures, such as perplexity, provide stable scores to analyze spoken language. Perplexity, which was originally...

Descripción completa

Detalles Bibliográficos
Autores principales: Colla, Davide, Delsanto, Matteo, Radicioni, Daniele P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761592/
https://www.ncbi.nlm.nih.gov/pubmed/36544569
http://dx.doi.org/10.1016/j.dib.2022.108799
_version_ 1784852710200180736
author Colla, Davide
Delsanto, Matteo
Radicioni, Daniele P.
author_facet Colla, Davide
Delsanto, Matteo
Radicioni, Daniele P.
author_sort Colla, Davide
collection PubMed
description The Semantic Coherence Dataset has been designed to experiment with semantic coherence metrics. More specifically, the dataset has been built to the ends of testing whether probabilistic measures, such as perplexity, provide stable scores to analyze spoken language. Perplexity, which was originally conceived as an information-theoretic measure to assess the probabilistic inference properties of language models, has recently been proven to be an appropriate tool to categorize speech transcripts based on semantic coherence accounts. More specifically, perplexity has been successfully employed to discriminate subjects suffering from Alzheimer Disease and healthy controls. Collected data include speech transcripts, intended to investigate semantic coherence at different levels: data are thus arranged into two classes, to investigate intra-subject semantic coherence, and inter-subject semantic coherence. In the former case transcripts from a single speaker can be employed to train and test language models and to explore whether the perplexity metric provides stable scores in assessing talks from that speaker, while allowing to distinguish between two different forms of speech, political rallies and interviews. In the latter case, models can be trained by employing transcripts from a given speaker, and then used to measure how stable the perplexity metric is when computed using the model from that user and transcripts from different users. Transcripts were extracted from talks lasting almost 13 hours (overall 12:45:17 and 120,326 tokens) for the former class; and almost 30 hours (29:47:34 and 252,270 tokens) for the latter one. Data herein can be reused to perform analyses on measures built on top of language models, and more in general on measures that are aimed at exploring the linguistic features of text documents.
format Online
Article
Text
id pubmed-9761592
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-97615922022-12-20 Semantic Coherence Dataset: Speech transcripts Colla, Davide Delsanto, Matteo Radicioni, Daniele P. Data Brief Data Article The Semantic Coherence Dataset has been designed to experiment with semantic coherence metrics. More specifically, the dataset has been built to the ends of testing whether probabilistic measures, such as perplexity, provide stable scores to analyze spoken language. Perplexity, which was originally conceived as an information-theoretic measure to assess the probabilistic inference properties of language models, has recently been proven to be an appropriate tool to categorize speech transcripts based on semantic coherence accounts. More specifically, perplexity has been successfully employed to discriminate subjects suffering from Alzheimer Disease and healthy controls. Collected data include speech transcripts, intended to investigate semantic coherence at different levels: data are thus arranged into two classes, to investigate intra-subject semantic coherence, and inter-subject semantic coherence. In the former case transcripts from a single speaker can be employed to train and test language models and to explore whether the perplexity metric provides stable scores in assessing talks from that speaker, while allowing to distinguish between two different forms of speech, political rallies and interviews. In the latter case, models can be trained by employing transcripts from a given speaker, and then used to measure how stable the perplexity metric is when computed using the model from that user and transcripts from different users. Transcripts were extracted from talks lasting almost 13 hours (overall 12:45:17 and 120,326 tokens) for the former class; and almost 30 hours (29:47:34 and 252,270 tokens) for the latter one. Data herein can be reused to perform analyses on measures built on top of language models, and more in general on measures that are aimed at exploring the linguistic features of text documents. Elsevier 2022-12-02 /pmc/articles/PMC9761592/ /pubmed/36544569 http://dx.doi.org/10.1016/j.dib.2022.108799 Text en © 2022 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Data Article
Colla, Davide
Delsanto, Matteo
Radicioni, Daniele P.
Semantic Coherence Dataset: Speech transcripts
title Semantic Coherence Dataset: Speech transcripts
title_full Semantic Coherence Dataset: Speech transcripts
title_fullStr Semantic Coherence Dataset: Speech transcripts
title_full_unstemmed Semantic Coherence Dataset: Speech transcripts
title_short Semantic Coherence Dataset: Speech transcripts
title_sort semantic coherence dataset: speech transcripts
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9761592/
https://www.ncbi.nlm.nih.gov/pubmed/36544569
http://dx.doi.org/10.1016/j.dib.2022.108799
work_keys_str_mv AT colladavide semanticcoherencedatasetspeechtranscripts
AT delsantomatteo semanticcoherencedatasetspeechtranscripts
AT radicionidanielep semanticcoherencedatasetspeechtranscripts