Cargando…
MiBio: A dataset for OCR post-processing evaluation
We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated er...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/ https://www.ncbi.nlm.nih.gov/pubmed/30364639 http://dx.doi.org/10.1016/j.dib.2018.08.099 |
_version_ | 1783364826482868224 |
---|---|
author | Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E. |
author_facet | Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E. |
author_sort | Mei, Jie |
collection | PubMed |
description | We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis. |
format | Online Article Text |
id | pubmed-6197712 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-61977122018-10-24 MiBio: A dataset for OCR post-processing evaluation Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E. Data Brief Computer Science We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis. Elsevier 2018-09-15 /pmc/articles/PMC6197712/ /pubmed/30364639 http://dx.doi.org/10.1016/j.dib.2018.08.099 Text en © 2018 Published by Elsevier Inc. http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Computer Science Mei, Jie Islam, Aminul Moh’d, Abidalrahman Wu, Yajing Milios, Evangelos E. MiBio: A dataset for OCR post-processing evaluation |
title | MiBio: A dataset for OCR post-processing evaluation |
title_full | MiBio: A dataset for OCR post-processing evaluation |
title_fullStr | MiBio: A dataset for OCR post-processing evaluation |
title_full_unstemmed | MiBio: A dataset for OCR post-processing evaluation |
title_short | MiBio: A dataset for OCR post-processing evaluation |
title_sort | mibio: a dataset for ocr post-processing evaluation |
topic | Computer Science |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197712/ https://www.ncbi.nlm.nih.gov/pubmed/30364639 http://dx.doi.org/10.1016/j.dib.2018.08.099 |
work_keys_str_mv | AT meijie mibioadatasetforocrpostprocessingevaluation AT islamaminul mibioadatasetforocrpostprocessingevaluation AT mohdabidalrahman mibioadatasetforocrpostprocessingevaluation AT wuyajing mibioadatasetforocrpostprocessingevaluation AT miliosevangelose mibioadatasetforocrpostprocessingevaluation |