Cargando…

MiBio: A dataset for OCR post-processing evaluation

We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated er...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mei, Jie, Islam, Aminul, Moh’d, Abidalrahman, Wu, Yajing, Milios, Evangelos E.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2018
Materias:	Computer Science
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/ https://www.ncbi.nlm.nih.gov/pubmed/30364639 http://dx.doi.org/10.1016/j.dib.2018.08.099

_version_	1783364826482868224
author	Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E.
author_facet	Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E.
author_sort	Mei, Jie
collection	PubMed
description	We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis.
format	Online Article Text
id	pubmed-6197712
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-61977122018-10-24 MiBio: A dataset for OCR post-processing evaluation Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E. Data Brief Computer Science We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis. Elsevier 2018-09-15 /pmc/articles/PMC6197712/ /pubmed/30364639 http://dx.doi.org/10.1016/j.dib.2018.08.099 Text en © 2018 Published by Elsevier Inc. http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Computer Science Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E. MiBio: A dataset for OCR post-processing evaluation
title	MiBio: A dataset for OCR post-processing evaluation
title_full	MiBio: A dataset for OCR post-processing evaluation
title_fullStr	MiBio: A dataset for OCR post-processing evaluation
title_full_unstemmed	MiBio: A dataset for OCR post-processing evaluation
title_short	MiBio: A dataset for OCR post-processing evaluation
title_sort	mibio: a dataset for ocr post-processing evaluation
topic	Computer Science
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/ https://www.ncbi.nlm.nih.gov/pubmed/30364639 http://dx.doi.org/10.1016/j.dib.2018.08.099
work_keys_str_mv	AT meijie mibioadatasetforocrpostprocessingevaluation AT islamaminul mibioadatasetforocrpostprocessingevaluation AT mohdabidalrahman mibioadatasetforocrpostprocessingevaluation AT wuyajing mibioadatasetforocrpostprocessingevaluation AT miliosevangelose mibioadatasetforocrpostprocessingevaluation

MiBio: A dataset for OCR post-processing evaluation

Ejemplares similares