Cargando…
Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts
Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitation...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8969102/ https://www.ncbi.nlm.nih.gov/pubmed/35370608 http://dx.doi.org/10.3389/fnagi.2022.830943 |
_version_ | 1784679187539296256 |
---|---|
author | Ilias, Loukas Askounis, Dimitris |
author_facet | Ilias, Loukas Askounis, Dimitris |
author_sort | Ilias, Loukas |
collection | PubMed |
description | Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitations. Specifically, the main limitation is pertinent to the way the modalities of speech and transcripts are combined in a single neural network. Existing research works add/concatenate the image and text representations, employ majority voting approaches or average the predictions after training many textual and speech models separately. To address these limitations, in this article we present some new methods to detect AD patients and predict the Mini-Mental State Examination (MMSE) scores in an end-to-end trainable manner consisting of a combination of BERT, Vision Transformer, Co-Attention, Multimodal Shifting Gate, and a variant of the self-attention mechanism. Specifically, we convert audio to Log-Mel spectrograms, their delta, and delta-delta (acceleration values). First, we pass each transcript and image through a BERT model and Vision Transformer, respectively, adding a co-attention layer at the top, which generates image and word attention simultaneously. Secondly, we propose an architecture, which integrates multimodal information to a BERT model via a Multimodal Shifting Gate. Finally, we introduce an approach to capture both the inter- and intra-modal interactions by concatenating the textual and visual representations and utilizing a self-attention mechanism, which includes a gate model. Experiments conducted on the ADReSS Challenge dataset indicate that our introduced models demonstrate valuable advantages over existing research initiatives achieving competitive results in both the AD classification and MMSE regression task. Specifically, our best performing model attains an accuracy of 90.00% and a Root Mean Squared Error (RMSE) of 3.61 in the AD classification task and MMSE regression task, respectively, achieving a new state-of-the-art performance in the MMSE regression task. |
format | Online Article Text |
id | pubmed-8969102 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-89691022022-04-01 Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts Ilias, Loukas Askounis, Dimitris Front Aging Neurosci Aging Neuroscience Alzheimer's dementia (AD) entails negative psychological, social, and economic consequences not only for the patients but also for their families, relatives, and society in general. Despite the significance of this phenomenon and the importance for an early diagnosis, there are still limitations. Specifically, the main limitation is pertinent to the way the modalities of speech and transcripts are combined in a single neural network. Existing research works add/concatenate the image and text representations, employ majority voting approaches or average the predictions after training many textual and speech models separately. To address these limitations, in this article we present some new methods to detect AD patients and predict the Mini-Mental State Examination (MMSE) scores in an end-to-end trainable manner consisting of a combination of BERT, Vision Transformer, Co-Attention, Multimodal Shifting Gate, and a variant of the self-attention mechanism. Specifically, we convert audio to Log-Mel spectrograms, their delta, and delta-delta (acceleration values). First, we pass each transcript and image through a BERT model and Vision Transformer, respectively, adding a co-attention layer at the top, which generates image and word attention simultaneously. Secondly, we propose an architecture, which integrates multimodal information to a BERT model via a Multimodal Shifting Gate. Finally, we introduce an approach to capture both the inter- and intra-modal interactions by concatenating the textual and visual representations and utilizing a self-attention mechanism, which includes a gate model. Experiments conducted on the ADReSS Challenge dataset indicate that our introduced models demonstrate valuable advantages over existing research initiatives achieving competitive results in both the AD classification and MMSE regression task. Specifically, our best performing model attains an accuracy of 90.00% and a Root Mean Squared Error (RMSE) of 3.61 in the AD classification task and MMSE regression task, respectively, achieving a new state-of-the-art performance in the MMSE regression task. Frontiers Media S.A. 2022-03-17 /pmc/articles/PMC8969102/ /pubmed/35370608 http://dx.doi.org/10.3389/fnagi.2022.830943 Text en Copyright © 2022 Ilias and Askounis. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Aging Neuroscience Ilias, Loukas Askounis, Dimitris Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts |
title | Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts |
title_full | Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts |
title_fullStr | Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts |
title_full_unstemmed | Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts |
title_short | Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts |
title_sort | multimodal deep learning models for detecting dementia from speech and transcripts |
topic | Aging Neuroscience |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8969102/ https://www.ncbi.nlm.nih.gov/pubmed/35370608 http://dx.doi.org/10.3389/fnagi.2022.830943 |
work_keys_str_mv | AT iliasloukas multimodaldeeplearningmodelsfordetectingdementiafromspeechandtranscripts AT askounisdimitris multimodaldeeplearningmodelsfordetectingdementiafromspeechandtranscripts |