Cargando…

Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning

Transformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lee, Hojun, Cho, Hyunjun, Park, Jieun, Chae, Jinyeong, Kim, Jihie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Perspective
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8874388/ https://www.ncbi.nlm.nih.gov/pubmed/35214330 http://dx.doi.org/10.3390/s22041429

_version_	1784657676478709760
author	Lee, Hojun Cho, Hyunjun Park, Jieun Chae, Jinyeong Kim, Jihie
author_facet	Lee, Hojun Cho, Hyunjun Park, Jieun Chae, Jinyeong Kim, Jihie
author_sort	Lee, Hojun
collection	PubMed
description	Transformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual Extractor (GLVE) to capture both global features and local features. (2) The Cross Encoder-Decoder Transformer (CEDT) for injecting multiple-level encoder features into the decoding process. GLVE extracts not only global visual features that can be obtained from an entire image, such as size of organ or bone structure, but also local visual features that can be generated from a local region, such as lesion area. Given an image, CEDT can create a detailed description of the overall features by injecting both low-level and high-level encoder outputs into the decoder. Each method contributes to performance improvement and generates a description such as organ size and bone structure. The proposed model was evaluated on the IU X-ray dataset and achieved better performance than the transformer-based baseline results, by 5.6% in BLEU score, by 0.56% in METEOR, and by 1.98% in ROUGE-L.
format	Online Article Text
id	pubmed-8874388
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-88743882022-02-26 Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning Lee, Hojun Cho, Hyunjun Park, Jieun Chae, Jinyeong Kim, Jihie Sensors (Basel) Perspective Transformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual Extractor (GLVE) to capture both global features and local features. (2) The Cross Encoder-Decoder Transformer (CEDT) for injecting multiple-level encoder features into the decoding process. GLVE extracts not only global visual features that can be obtained from an entire image, such as size of organ or bone structure, but also local visual features that can be generated from a local region, such as lesion area. Given an image, CEDT can create a detailed description of the overall features by injecting both low-level and high-level encoder outputs into the decoder. Each method contributes to performance improvement and generates a description such as organ size and bone structure. The proposed model was evaluated on the IU X-ray dataset and achieved better performance than the transformer-based baseline results, by 5.6% in BLEU score, by 0.56% in METEOR, and by 1.98% in ROUGE-L. MDPI 2022-02-13 /pmc/articles/PMC8874388/ /pubmed/35214330 http://dx.doi.org/10.3390/s22041429 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Perspective Lee, Hojun Cho, Hyunjun Park, Jieun Chae, Jinyeong Kim, Jihie Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_full	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_fullStr	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_full_unstemmed	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_short	Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
title_sort	cross encoder-decoder transformer with global-local visual extractor for medical image captioning
topic	Perspective
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8874388/ https://www.ncbi.nlm.nih.gov/pubmed/35214330 http://dx.doi.org/10.3390/s22041429
work_keys_str_mv	AT leehojun crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT chohyunjun crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT parkjieun crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT chaejinyeong crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning AT kimjihie crossencoderdecodertransformerwithgloballocalvisualextractorformedicalimagecaptioning

Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning

Ejemplares similares