Cargando…

Vision–Language Model for Visual Question Answering in Medical Imagery

In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this tech...

Descripción completa

Detalles Bibliográficos
Autores principales: Bazi, Yakoub, Rahhal, Mohamad Mahmoud Al, Bashmal, Laila, Zuair, Mansour
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10045796/
https://www.ncbi.nlm.nih.gov/pubmed/36978771
http://dx.doi.org/10.3390/bioengineering10030380
_version_ 1784913692611051520
author Bazi, Yakoub
Rahhal, Mohamad Mahmoud Al
Bashmal, Laila
Zuair, Mansour
author_facet Bazi, Yakoub
Rahhal, Mohamad Mahmoud Al
Bashmal, Laila
Zuair, Mansour
author_sort Bazi, Yakoub
collection PubMed
description In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder–decoder architecture. Specifically, we extract image features using the vision transformer (ViT) model, and we embed the question using a textual encoder transformer. Then, we concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way. In the experiments, we validate the proposed model on two VQA datasets for radiology images termed VQA-RAD and PathVQA. The model shows promising results compared to existing solutions. It yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and 83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score showing the alignment between the predicted and true answer sentences are also reported.
format Online
Article
Text
id pubmed-10045796
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-100457962023-03-29 Vision–Language Model for Visual Question Answering in Medical Imagery Bazi, Yakoub Rahhal, Mohamad Mahmoud Al Bashmal, Laila Zuair, Mansour Bioengineering (Basel) Article In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder–decoder architecture. Specifically, we extract image features using the vision transformer (ViT) model, and we embed the question using a textual encoder transformer. Then, we concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way. In the experiments, we validate the proposed model on two VQA datasets for radiology images termed VQA-RAD and PathVQA. The model shows promising results compared to existing solutions. It yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and 83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score showing the alignment between the predicted and true answer sentences are also reported. MDPI 2023-03-20 /pmc/articles/PMC10045796/ /pubmed/36978771 http://dx.doi.org/10.3390/bioengineering10030380 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Bazi, Yakoub
Rahhal, Mohamad Mahmoud Al
Bashmal, Laila
Zuair, Mansour
Vision–Language Model for Visual Question Answering in Medical Imagery
title Vision–Language Model for Visual Question Answering in Medical Imagery
title_full Vision–Language Model for Visual Question Answering in Medical Imagery
title_fullStr Vision–Language Model for Visual Question Answering in Medical Imagery
title_full_unstemmed Vision–Language Model for Visual Question Answering in Medical Imagery
title_short Vision–Language Model for Visual Question Answering in Medical Imagery
title_sort vision–language model for visual question answering in medical imagery
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10045796/
https://www.ncbi.nlm.nih.gov/pubmed/36978771
http://dx.doi.org/10.3390/bioengineering10030380
work_keys_str_mv AT baziyakoub visionlanguagemodelforvisualquestionansweringinmedicalimagery
AT rahhalmohamadmahmoudal visionlanguagemodelforvisualquestionansweringinmedicalimagery
AT bashmallaila visionlanguagemodelforvisualquestionansweringinmedicalimagery
AT zuairmansour visionlanguagemodelforvisualquestionansweringinmedicalimagery