Cargando…

Assessing the Impact of OCR Errors in Information Retrieval

A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is ne...

Descripción completa

Detalles Bibliográficos
Autores principales: Bazzo, Guilherme Torresan, Lorentz, Gustavo Acauan, Suarez Vargas, Danny, Moreira, Viviane P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148068/
http://dx.doi.org/10.1007/978-3-030-45442-5_13
_version_ 1783520524339511296
author Bazzo, Guilherme Torresan
Lorentz, Gustavo Acauan
Suarez Vargas, Danny
Moreira, Viviane P.
author_facet Bazzo, Guilherme Torresan
Lorentz, Gustavo Acauan
Suarez Vargas, Danny
Moreira, Viviane P.
author_sort Bazzo, Guilherme Torresan
collection PubMed
description A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR errors and investigate the impact that misspelled words have on retrieval accuracy. In order to quantify such impact, errors were systematically inserted at varying rates in an initially clean IR collection. Our results showed that significant impacts are noticed starting at a 5% error rate. Furthermore, stemming has proven to make systems more robust to errors.
format Online
Article
Text
id pubmed-7148068
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-71480682020-04-13 Assessing the Impact of OCR Errors in Information Retrieval Bazzo, Guilherme Torresan Lorentz, Gustavo Acauan Suarez Vargas, Danny Moreira, Viviane P. Advances in Information Retrieval Article A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR errors and investigate the impact that misspelled words have on retrieval accuracy. In order to quantify such impact, errors were systematically inserted at varying rates in an initially clean IR collection. Our results showed that significant impacts are noticed starting at a 5% error rate. Furthermore, stemming has proven to make systems more robust to errors. 2020-03-24 /pmc/articles/PMC7148068/ http://dx.doi.org/10.1007/978-3-030-45442-5_13 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Bazzo, Guilherme Torresan
Lorentz, Gustavo Acauan
Suarez Vargas, Danny
Moreira, Viviane P.
Assessing the Impact of OCR Errors in Information Retrieval
title Assessing the Impact of OCR Errors in Information Retrieval
title_full Assessing the Impact of OCR Errors in Information Retrieval
title_fullStr Assessing the Impact of OCR Errors in Information Retrieval
title_full_unstemmed Assessing the Impact of OCR Errors in Information Retrieval
title_short Assessing the Impact of OCR Errors in Information Retrieval
title_sort assessing the impact of ocr errors in information retrieval
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148068/
http://dx.doi.org/10.1007/978-3-030-45442-5_13
work_keys_str_mv AT bazzoguilhermetorresan assessingtheimpactofocrerrorsininformationretrieval
AT lorentzgustavoacauan assessingtheimpactofocrerrorsininformationretrieval
AT suarezvargasdanny assessingtheimpactofocrerrorsininformationretrieval
AT moreiravivianep assessingtheimpactofocrerrorsininformationretrieval