Cargando…

Layout-aware text extraction from full-text PDF of scientific articles

BACKGROUND: The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocurati...

Descripción completa

Detalles Bibliográficos
Autores principales: Ramakrishnan, Cartic, Patnia, Abhishek, Hovy, Eduard, Burns, Gully APC
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3441580/
https://www.ncbi.nlm.nih.gov/pubmed/22640904
http://dx.doi.org/10.1186/1751-0473-7-7
_version_ 1782243323923136512
author Ramakrishnan, Cartic
Patnia, Abhishek
Hovy, Eduard
Burns, Gully APC
author_facet Ramakrishnan, Cartic
Patnia, Abhishek
Hovy, Eduard
Burns, Gully APC
author_sort Ramakrishnan, Cartic
collection PubMed
description BACKGROUND: The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. RESULTS: Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision(1) = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, (2)commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement. CONCLUSIONS: LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.
format Online
Article
Text
id pubmed-3441580
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34415802012-09-18 Layout-aware text extraction from full-text PDF of scientific articles Ramakrishnan, Cartic Patnia, Abhishek Hovy, Eduard Burns, Gully APC Source Code Biol Med Software Review BACKGROUND: The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. RESULTS: Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision(1) = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, (2)commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement. CONCLUSIONS: LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/. BioMed Central 2012-05-28 /pmc/articles/PMC3441580/ /pubmed/22640904 http://dx.doi.org/10.1186/1751-0473-7-7 Text en Copyright ©2012 Ramakrishnan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software Review
Ramakrishnan, Cartic
Patnia, Abhishek
Hovy, Eduard
Burns, Gully APC
Layout-aware text extraction from full-text PDF of scientific articles
title Layout-aware text extraction from full-text PDF of scientific articles
title_full Layout-aware text extraction from full-text PDF of scientific articles
title_fullStr Layout-aware text extraction from full-text PDF of scientific articles
title_full_unstemmed Layout-aware text extraction from full-text PDF of scientific articles
title_short Layout-aware text extraction from full-text PDF of scientific articles
title_sort layout-aware text extraction from full-text pdf of scientific articles
topic Software Review
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3441580/
https://www.ncbi.nlm.nih.gov/pubmed/22640904
http://dx.doi.org/10.1186/1751-0473-7-7
work_keys_str_mv AT ramakrishnancartic layoutawaretextextractionfromfulltextpdfofscientificarticles
AT patniaabhishek layoutawaretextextractionfromfulltextpdfofscientificarticles
AT hovyeduard layoutawaretextextractionfromfulltextpdfofscientificarticles
AT burnsgullyapc layoutawaretextextractionfromfulltextpdfofscientificarticles