Cargando…

Display-Semantic Transformer for Scene Text Recognition

Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Xinqi, Silamu, Wushour, Xu, Miaomiao, Li, Yanbing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10574938/
https://www.ncbi.nlm.nih.gov/pubmed/37836989
http://dx.doi.org/10.3390/s23198159
_version_ 1785120806162923520
author Yang, Xinqi
Silamu, Wushour
Xu, Miaomiao
Li, Yanbing
author_facet Yang, Xinqi
Silamu, Wushour
Xu, Miaomiao
Li, Yanbing
author_sort Yang, Xinqi
collection PubMed
description Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy (distorted and blurry, etc.) images. In order to address the aforementioned issues, this study builds upon the most recent findings of the Vision Transformer, and our approach (called Display-Semantic Transformer, or DST for short) constructs a masked language model and a semantic visual interaction module. The model can mine deep semantic information from images to assist scene text recognition and improve the robustness of the model. The semantic visual interaction module can better realize the interaction between semantic information and visual features. In this way, the visual features can be enhanced by the semantic information so that the model can achieve a better recognition effect. The experimental results show that our model improves the average recognition accuracy on six benchmark test sets by nearly 2% compared to the baseline. Our model retains the benefits of having a small number of parameters and allows for fast inference speed. Additionally, it attains a more optimal balance between accuracy and speed.
format Online
Article
Text
id pubmed-10574938
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-105749382023-10-14 Display-Semantic Transformer for Scene Text Recognition Yang, Xinqi Silamu, Wushour Xu, Miaomiao Li, Yanbing Sensors (Basel) Article Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy (distorted and blurry, etc.) images. In order to address the aforementioned issues, this study builds upon the most recent findings of the Vision Transformer, and our approach (called Display-Semantic Transformer, or DST for short) constructs a masked language model and a semantic visual interaction module. The model can mine deep semantic information from images to assist scene text recognition and improve the robustness of the model. The semantic visual interaction module can better realize the interaction between semantic information and visual features. In this way, the visual features can be enhanced by the semantic information so that the model can achieve a better recognition effect. The experimental results show that our model improves the average recognition accuracy on six benchmark test sets by nearly 2% compared to the baseline. Our model retains the benefits of having a small number of parameters and allows for fast inference speed. Additionally, it attains a more optimal balance between accuracy and speed. MDPI 2023-09-28 /pmc/articles/PMC10574938/ /pubmed/37836989 http://dx.doi.org/10.3390/s23198159 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Yang, Xinqi
Silamu, Wushour
Xu, Miaomiao
Li, Yanbing
Display-Semantic Transformer for Scene Text Recognition
title Display-Semantic Transformer for Scene Text Recognition
title_full Display-Semantic Transformer for Scene Text Recognition
title_fullStr Display-Semantic Transformer for Scene Text Recognition
title_full_unstemmed Display-Semantic Transformer for Scene Text Recognition
title_short Display-Semantic Transformer for Scene Text Recognition
title_sort display-semantic transformer for scene text recognition
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10574938/
https://www.ncbi.nlm.nih.gov/pubmed/37836989
http://dx.doi.org/10.3390/s23198159
work_keys_str_mv AT yangxinqi displaysemantictransformerforscenetextrecognition
AT silamuwushour displaysemantictransformerforscenetextrecognition
AT xumiaomiao displaysemantictransformerforscenetextrecognition
AT liyanbing displaysemantictransformerforscenetextrecognition