Cargando…

A framework for biomedical figure segmentation towards image-based document retrieval

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification an...

Descripción completa

Detalles Bibliográficos
Autores principales: Lopez, Luis D, Yu, Jingyi, Arighi, Cecilia, Tudor, Catalina O, Torii, Manabu, Huang, Hongzhan, Vijay-Shanker, K, Wu, Cathy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856606/
https://www.ncbi.nlm.nih.gov/pubmed/24565394
http://dx.doi.org/10.1186/1752-0509-7-S4-S8
_version_ 1782295088291905536
author Lopez, Luis D
Yu, Jingyi
Arighi, Cecilia
Tudor, Catalina O
Torii, Manabu
Huang, Hongzhan
Vijay-Shanker, K
Wu, Cathy
author_facet Lopez, Luis D
Yu, Jingyi
Arighi, Cecilia
Tudor, Catalina O
Torii, Manabu
Huang, Hongzhan
Vijay-Shanker, K
Wu, Cathy
author_sort Lopez, Luis D
collection PubMed
description The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries.
format Online
Article
Text
id pubmed-3856606
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38566062013-12-16 A framework for biomedical figure segmentation towards image-based document retrieval Lopez, Luis D Yu, Jingyi Arighi, Cecilia Tudor, Catalina O Torii, Manabu Huang, Hongzhan Vijay-Shanker, K Wu, Cathy BMC Syst Biol Research The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries. BioMed Central 2013-10-23 /pmc/articles/PMC3856606/ /pubmed/24565394 http://dx.doi.org/10.1186/1752-0509-7-S4-S8 Text en Copyright © 2013 Lopez et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Lopez, Luis D
Yu, Jingyi
Arighi, Cecilia
Tudor, Catalina O
Torii, Manabu
Huang, Hongzhan
Vijay-Shanker, K
Wu, Cathy
A framework for biomedical figure segmentation towards image-based document retrieval
title A framework for biomedical figure segmentation towards image-based document retrieval
title_full A framework for biomedical figure segmentation towards image-based document retrieval
title_fullStr A framework for biomedical figure segmentation towards image-based document retrieval
title_full_unstemmed A framework for biomedical figure segmentation towards image-based document retrieval
title_short A framework for biomedical figure segmentation towards image-based document retrieval
title_sort framework for biomedical figure segmentation towards image-based document retrieval
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856606/
https://www.ncbi.nlm.nih.gov/pubmed/24565394
http://dx.doi.org/10.1186/1752-0509-7-S4-S8
work_keys_str_mv AT lopezluisd aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT yujingyi aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT arighicecilia aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT tudorcatalinao aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT toriimanabu aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT huanghongzhan aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT vijayshankerk aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT wucathy aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT lopezluisd frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT yujingyi frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT arighicecilia frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT tudorcatalinao frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT toriimanabu frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT huanghongzhan frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT vijayshankerk frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval
AT wucathy frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval