Cargando…

Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition

Connectionist temporal classification (CTC) is a favored decoder in scene text recognition (STR) for its simplicity and efficiency. However, most CTC-based methods utilize one-dimensional (1D) vector sequences, usually derived from a recurrent neural network (RNN) encoder. This results in the absenc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Buoy, Rina, Iwamura, Masakazu, Srun, Sovila, Kise, Koichi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10672533/ https://www.ncbi.nlm.nih.gov/pubmed/37998095 http://dx.doi.org/10.3390/jimaging9110248

_version_	1785149530121961472
author	Buoy, Rina Iwamura, Masakazu Srun, Sovila Kise, Koichi
author_facet	Buoy, Rina Iwamura, Masakazu Srun, Sovila Kise, Koichi
author_sort	Buoy, Rina
collection	PubMed
description	Connectionist temporal classification (CTC) is a favored decoder in scene text recognition (STR) for its simplicity and efficiency. However, most CTC-based methods utilize one-dimensional (1D) vector sequences, usually derived from a recurrent neural network (RNN) encoder. This results in the absence of explainable 2D spatial relationship between the predicted characters and corresponding image regions, essential for model explainability. On the other hand, 2D attention-based methods enhance recognition accuracy and offer character location information via cross-attention mechanisms, linking predictions to image regions. However, these methods are more computationally intensive, compared with the 1D CTC-based methods. To achieve both low latency and model explainability via character localization using a 1D CTC decoder, we propose a marginalization-based method that processes 2D feature maps and predicts a sequence of 2D joint probability distributions over the height and class dimensions. Based on the proposed method, we newly introduce an association map that aids in character localization and model prediction explanation. This map parallels the role of a cross-attention map, as seen in computationally-intensive attention-based architectures. With the proposed method, we consider a ViT-CTC STR architecture that uses a 1D CTC decoder and a pretrained vision Transformer (ViT) as a 2D feature extractor. Our ViT-CTC models were trained on synthetic data and fine-tuned on real labeled sets. These models outperform the recent state-of-the-art (SOTA) CTC-based methods on benchmarks in terms of recognition accuracy. Compared with the baseline Transformer-decoder-based models, our ViT-CTC models offer a speed boost up to 12 times regardless of the backbone, with a maximum 3.1% reduction in total word recognition accuracy. In addition, both qualitative and quantitative assessments of character locations estimated from the association map align closely with those from the cross-attention map and ground-truth character-level bounding boxes.
format	Online Article Text
id	pubmed-10672533
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-106725332023-11-15 Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition Buoy, Rina Iwamura, Masakazu Srun, Sovila Kise, Koichi J Imaging Article Connectionist temporal classification (CTC) is a favored decoder in scene text recognition (STR) for its simplicity and efficiency. However, most CTC-based methods utilize one-dimensional (1D) vector sequences, usually derived from a recurrent neural network (RNN) encoder. This results in the absence of explainable 2D spatial relationship between the predicted characters and corresponding image regions, essential for model explainability. On the other hand, 2D attention-based methods enhance recognition accuracy and offer character location information via cross-attention mechanisms, linking predictions to image regions. However, these methods are more computationally intensive, compared with the 1D CTC-based methods. To achieve both low latency and model explainability via character localization using a 1D CTC decoder, we propose a marginalization-based method that processes 2D feature maps and predicts a sequence of 2D joint probability distributions over the height and class dimensions. Based on the proposed method, we newly introduce an association map that aids in character localization and model prediction explanation. This map parallels the role of a cross-attention map, as seen in computationally-intensive attention-based architectures. With the proposed method, we consider a ViT-CTC STR architecture that uses a 1D CTC decoder and a pretrained vision Transformer (ViT) as a 2D feature extractor. Our ViT-CTC models were trained on synthetic data and fine-tuned on real labeled sets. These models outperform the recent state-of-the-art (SOTA) CTC-based methods on benchmarks in terms of recognition accuracy. Compared with the baseline Transformer-decoder-based models, our ViT-CTC models offer a speed boost up to 12 times regardless of the backbone, with a maximum 3.1% reduction in total word recognition accuracy. In addition, both qualitative and quantitative assessments of character locations estimated from the association map align closely with those from the cross-attention map and ground-truth character-level bounding boxes. MDPI 2023-11-15 /pmc/articles/PMC10672533/ /pubmed/37998095 http://dx.doi.org/10.3390/jimaging9110248 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Buoy, Rina Iwamura, Masakazu Srun, Sovila Kise, Koichi Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
title	Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
title_full	Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
title_fullStr	Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
title_full_unstemmed	Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
title_short	Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
title_sort	explainable connectionist-temporal-classification-based scene text recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10672533/ https://www.ncbi.nlm.nih.gov/pubmed/37998095 http://dx.doi.org/10.3390/jimaging9110248
work_keys_str_mv	AT buoyrina explainableconnectionisttemporalclassificationbasedscenetextrecognition AT iwamuramasakazu explainableconnectionisttemporalclassificationbasedscenetextrecognition AT srunsovila explainableconnectionisttemporalclassificationbasedscenetextrecognition AT kisekoichi explainableconnectionisttemporalclassificationbasedscenetextrecognition

Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition

Ejemplares similares