Cargando…

Figure and caption extraction from biomedical documents

MOTIVATION: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respective captions as a source of knowledge. Notably, an essential step underlying such mining is the extraction...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Pengyuan, Jiang, Xiangying, Shatkay, Hagit
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2019
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6821181/ https://www.ncbi.nlm.nih.gov/pubmed/30949681 http://dx.doi.org/10.1093/bioinformatics/btz228

_version_	1783464100477534208
author	Li, Pengyuan Jiang, Xiangying Shatkay, Hagit
author_facet	Li, Pengyuan Jiang, Xiangying Shatkay, Hagit
author_sort	Li, Pengyuan
collection	PubMed
description	MOTIVATION: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respective captions as a source of knowledge. Notably, an essential step underlying such mining is the extraction of figures and captions from publications. While several PDF parsing tools that extract information from such documents are publicly available, they attempt to identify images by analyzing the PDF encoding and structure and the complex graphical objects embedded within. As such, they often incorrectly identify figures and captions in scientific publications, whose structure is often non-trivial. The extraction of figures, captions and figure-caption pairs from biomedical publications is thus neither well-studied nor yet well-addressed. RESULTS: We introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike existing methods, we first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions. We generate files containing the figures and their associated captions and provide those as output to the end-user. We test our system both over a public dataset of computer science documents previously used by others, and over two newly collected sets of publications focusing on the biomedical domain. Our experiments and results comparing PDFigCapX to other state-of-the-art systems show a significant improvement in performance, and demonstrate the effectiveness and robustness of our approach. AVAILABILITY AND IMPLEMENTATION: Our system is publicly available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX. The two new datasets are available at: https://www.eecis.udel.edu/~compbio/PDFigCapX/Downloads
format	Online Article Text
id	pubmed-6821181
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-68211812019-11-04 Figure and caption extraction from biomedical documents Li, Pengyuan Jiang, Xiangying Shatkay, Hagit Bioinformatics Original Papers MOTIVATION: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respective captions as a source of knowledge. Notably, an essential step underlying such mining is the extraction of figures and captions from publications. While several PDF parsing tools that extract information from such documents are publicly available, they attempt to identify images by analyzing the PDF encoding and structure and the complex graphical objects embedded within. As such, they often incorrectly identify figures and captions in scientific publications, whose structure is often non-trivial. The extraction of figures, captions and figure-caption pairs from biomedical publications is thus neither well-studied nor yet well-addressed. RESULTS: We introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike existing methods, we first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions. We generate files containing the figures and their associated captions and provide those as output to the end-user. We test our system both over a public dataset of computer science documents previously used by others, and over two newly collected sets of publications focusing on the biomedical domain. Our experiments and results comparing PDFigCapX to other state-of-the-art systems show a significant improvement in performance, and demonstrate the effectiveness and robustness of our approach. AVAILABILITY AND IMPLEMENTATION: Our system is publicly available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX. The two new datasets are available at: https://www.eecis.udel.edu/~compbio/PDFigCapX/Downloads Oxford University Press 2019-11-01 2019-04-05 /pmc/articles/PMC6821181/ /pubmed/30949681 http://dx.doi.org/10.1093/bioinformatics/btz228 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Papers Li, Pengyuan Jiang, Xiangying Shatkay, Hagit Figure and caption extraction from biomedical documents
title	Figure and caption extraction from biomedical documents
title_full	Figure and caption extraction from biomedical documents
title_fullStr	Figure and caption extraction from biomedical documents
title_full_unstemmed	Figure and caption extraction from biomedical documents
title_short	Figure and caption extraction from biomedical documents
title_sort	figure and caption extraction from biomedical documents
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6821181/ https://www.ncbi.nlm.nih.gov/pubmed/30949681 http://dx.doi.org/10.1093/bioinformatics/btz228
work_keys_str_mv	AT lipengyuan figureandcaptionextractionfrombiomedicaldocuments AT jiangxiangying figureandcaptionextractionfrombiomedicaldocuments AT shatkayhagit figureandcaptionextractionfrombiomedicaldocuments

Figure and caption extraction from biomedical documents

Ejemplares similares