Cargando…

Display-Semantic Transformer for Scene Text Recognition

Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Xinqi, Silamu, Wushour, Xu, Miaomiao, Li, Yanbing
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10574938/ https://www.ncbi.nlm.nih.gov/pubmed/37836989 http://dx.doi.org/10.3390/s23198159

_version_	1785120806162923520
author	Yang, Xinqi Silamu, Wushour Xu, Miaomiao Li, Yanbing
author_facet	Yang, Xinqi Silamu, Wushour Xu, Miaomiao Li, Yanbing
author_sort	Yang, Xinqi
collection	PubMed
description	Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy (distorted and blurry, etc.) images. In order to address the aforementioned issues, this study builds upon the most recent findings of the Vision Transformer, and our approach (called Display-Semantic Transformer, or DST for short) constructs a masked language model and a semantic visual interaction module. The model can mine deep semantic information from images to assist scene text recognition and improve the robustness of the model. The semantic visual interaction module can better realize the interaction between semantic information and visual features. In this way, the visual features can be enhanced by the semantic information so that the model can achieve a better recognition effect. The experimental results show that our model improves the average recognition accuracy on six benchmark test sets by nearly 2% compared to the baseline. Our model retains the benefits of having a small number of parameters and allows for fast inference speed. Additionally, it attains a more optimal balance between accuracy and speed.
format	Online Article Text
id	pubmed-10574938
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-105749382023-10-14 Display-Semantic Transformer for Scene Text Recognition Yang, Xinqi Silamu, Wushour Xu, Miaomiao Li, Yanbing Sensors (Basel) Article Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy (distorted and blurry, etc.) images. In order to address the aforementioned issues, this study builds upon the most recent findings of the Vision Transformer, and our approach (called Display-Semantic Transformer, or DST for short) constructs a masked language model and a semantic visual interaction module. The model can mine deep semantic information from images to assist scene text recognition and improve the robustness of the model. The semantic visual interaction module can better realize the interaction between semantic information and visual features. In this way, the visual features can be enhanced by the semantic information so that the model can achieve a better recognition effect. The experimental results show that our model improves the average recognition accuracy on six benchmark test sets by nearly 2% compared to the baseline. Our model retains the benefits of having a small number of parameters and allows for fast inference speed. Additionally, it attains a more optimal balance between accuracy and speed. MDPI 2023-09-28 /pmc/articles/PMC10574938/ /pubmed/37836989 http://dx.doi.org/10.3390/s23198159 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Yang, Xinqi Silamu, Wushour Xu, Miaomiao Li, Yanbing Display-Semantic Transformer for Scene Text Recognition
title	Display-Semantic Transformer for Scene Text Recognition
title_full	Display-Semantic Transformer for Scene Text Recognition
title_fullStr	Display-Semantic Transformer for Scene Text Recognition
title_full_unstemmed	Display-Semantic Transformer for Scene Text Recognition
title_short	Display-Semantic Transformer for Scene Text Recognition
title_sort	display-semantic transformer for scene text recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10574938/ https://www.ncbi.nlm.nih.gov/pubmed/37836989 http://dx.doi.org/10.3390/s23198159
work_keys_str_mv	AT yangxinqi displaysemantictransformerforscenetextrecognition AT silamuwushour displaysemantictransformerforscenetextrecognition AT xumiaomiao displaysemantictransformerforscenetextrecognition AT liyanbing displaysemantictransformerforscenetextrecognition

Display-Semantic Transformer for Scene Text Recognition

Ejemplares similares