Cargando…

MiBio: A dataset for OCR post-processing evaluation

We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated er...

Descripción completa

Detalles Bibliográficos
Autores principales: Mei, Jie, Islam, Aminul, Moh’d, Abidalrahman, Wu, Yajing, Milios, Evangelos E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/
https://www.ncbi.nlm.nih.gov/pubmed/30364639
http://dx.doi.org/10.1016/j.dib.2018.08.099
_version_ 1783364826482868224
author Mei, Jie
Islam, Aminul
Moh’d, Abidalrahman
Wu, Yajing
Milios, Evangelos E.
author_facet Mei, Jie
Islam, Aminul
Moh’d, Abidalrahman
Wu, Yajing
Milios, Evangelos E.
author_sort Mei, Jie
collection PubMed
description We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis.
format Online
Article
Text
id pubmed-6197712
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-61977122018-10-24 MiBio: A dataset for OCR post-processing evaluation Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E. Data Brief Computer Science We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis. Elsevier 2018-09-15 /pmc/articles/PMC6197712/ /pubmed/30364639 http://dx.doi.org/10.1016/j.dib.2018.08.099 Text en © 2018 Published by Elsevier Inc. http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Computer Science
Mei, Jie
Islam, Aminul
Moh’d, Abidalrahman
Wu, Yajing
Milios, Evangelos E.
MiBio: A dataset for OCR post-processing evaluation
title MiBio: A dataset for OCR post-processing evaluation
title_full MiBio: A dataset for OCR post-processing evaluation
title_fullStr MiBio: A dataset for OCR post-processing evaluation
title_full_unstemmed MiBio: A dataset for OCR post-processing evaluation
title_short MiBio: A dataset for OCR post-processing evaluation
title_sort mibio: a dataset for ocr post-processing evaluation
topic Computer Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/
https://www.ncbi.nlm.nih.gov/pubmed/30364639
http://dx.doi.org/10.1016/j.dib.2018.08.099
work_keys_str_mv AT meijie mibioadatasetforocrpostprocessingevaluation
AT islamaminul mibioadatasetforocrpostprocessingevaluation
AT mohdabidalrahman mibioadatasetforocrpostprocessingevaluation
AT wuyajing mibioadatasetforocrpostprocessingevaluation
AT miliosevangelose mibioadatasetforocrpostprocessingevaluation