Cargando…

Urdu text in natural scene images: a new dataset and preliminary text detection

Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is...

Descripción completa

Detalles Bibliográficos
Autores principales: Ali, Hazrat, Iqbal, Khalid, Mujtaba, Ghulam, Fayyaz, Ahmad, Bulbul, Mohammad Farhad, Karam, Fazal Wahab, Zahir, Ali
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459794/
https://www.ncbi.nlm.nih.gov/pubmed/34616893
http://dx.doi.org/10.7717/peerj-cs.717
_version_ 1784571601901060096
author Ali, Hazrat
Iqbal, Khalid
Mujtaba, Ghulam
Fayyaz, Ahmad
Bulbul, Mohammad Farhad
Karam, Fazal Wahab
Zahir, Ali
author_facet Ali, Hazrat
Iqbal, Khalid
Mujtaba, Ghulam
Fayyaz, Ahmad
Bulbul, Mohammad Farhad
Karam, Fazal Wahab
Zahir, Ali
author_sort Ali, Hazrat
collection PubMed
description Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.
format Online
Article
Text
id pubmed-8459794
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-84597942021-10-05 Urdu text in natural scene images: a new dataset and preliminary text detection Ali, Hazrat Iqbal, Khalid Mujtaba, Ghulam Fayyaz, Ahmad Bulbul, Mohammad Farhad Karam, Fazal Wahab Zahir, Ali PeerJ Comput Sci Computer Vision Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction. PeerJ Inc. 2021-09-16 /pmc/articles/PMC8459794/ /pubmed/34616893 http://dx.doi.org/10.7717/peerj-cs.717 Text en © 2021 Ali et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Computer Vision
Ali, Hazrat
Iqbal, Khalid
Mujtaba, Ghulam
Fayyaz, Ahmad
Bulbul, Mohammad Farhad
Karam, Fazal Wahab
Zahir, Ali
Urdu text in natural scene images: a new dataset and preliminary text detection
title Urdu text in natural scene images: a new dataset and preliminary text detection
title_full Urdu text in natural scene images: a new dataset and preliminary text detection
title_fullStr Urdu text in natural scene images: a new dataset and preliminary text detection
title_full_unstemmed Urdu text in natural scene images: a new dataset and preliminary text detection
title_short Urdu text in natural scene images: a new dataset and preliminary text detection
title_sort urdu text in natural scene images: a new dataset and preliminary text detection
topic Computer Vision
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459794/
https://www.ncbi.nlm.nih.gov/pubmed/34616893
http://dx.doi.org/10.7717/peerj-cs.717
work_keys_str_mv AT alihazrat urdutextinnaturalsceneimagesanewdatasetandpreliminarytextdetection
AT iqbalkhalid urdutextinnaturalsceneimagesanewdatasetandpreliminarytextdetection
AT mujtabaghulam urdutextinnaturalsceneimagesanewdatasetandpreliminarytextdetection
AT fayyazahmad urdutextinnaturalsceneimagesanewdatasetandpreliminarytextdetection
AT bulbulmohammadfarhad urdutextinnaturalsceneimagesanewdatasetandpreliminarytextdetection
AT karamfazalwahab urdutextinnaturalsceneimagesanewdatasetandpreliminarytextdetection
AT zahirali urdutextinnaturalsceneimagesanewdatasetandpreliminarytextdetection