Cargando…

A framework for biomedical figure segmentation towards image-based document retrieval

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification an...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lopez, Luis D, Yu, Jingyi, Arighi, Cecilia, Tudor, Catalina O, Torii, Manabu, Huang, Hongzhan, Vijay-Shanker, K, Wu, Cathy
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856606/ https://www.ncbi.nlm.nih.gov/pubmed/24565394 http://dx.doi.org/10.1186/1752-0509-7-S4-S8

_version_	1782295088291905536
author	Lopez, Luis D Yu, Jingyi Arighi, Cecilia Tudor, Catalina O Torii, Manabu Huang, Hongzhan Vijay-Shanker, K Wu, Cathy
author_facet	Lopez, Luis D Yu, Jingyi Arighi, Cecilia Tudor, Catalina O Torii, Manabu Huang, Hongzhan Vijay-Shanker, K Wu, Cathy
author_sort	Lopez, Luis D
collection	PubMed
description	The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries.
format	Online Article Text
id	pubmed-3856606
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-38566062013-12-16 A framework for biomedical figure segmentation towards image-based document retrieval Lopez, Luis D Yu, Jingyi Arighi, Cecilia Tudor, Catalina O Torii, Manabu Huang, Hongzhan Vijay-Shanker, K Wu, Cathy BMC Syst Biol Research The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries. BioMed Central 2013-10-23 /pmc/articles/PMC3856606/ /pubmed/24565394 http://dx.doi.org/10.1186/1752-0509-7-S4-S8 Text en Copyright © 2013 Lopez et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lopez, Luis D Yu, Jingyi Arighi, Cecilia Tudor, Catalina O Torii, Manabu Huang, Hongzhan Vijay-Shanker, K Wu, Cathy A framework for biomedical figure segmentation towards image-based document retrieval
title	A framework for biomedical figure segmentation towards image-based document retrieval
title_full	A framework for biomedical figure segmentation towards image-based document retrieval
title_fullStr	A framework for biomedical figure segmentation towards image-based document retrieval
title_full_unstemmed	A framework for biomedical figure segmentation towards image-based document retrieval
title_short	A framework for biomedical figure segmentation towards image-based document retrieval
title_sort	framework for biomedical figure segmentation towards image-based document retrieval
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856606/ https://www.ncbi.nlm.nih.gov/pubmed/24565394 http://dx.doi.org/10.1186/1752-0509-7-S4-S8
work_keys_str_mv	AT lopezluisd aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT yujingyi aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT arighicecilia aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT tudorcatalinao aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT toriimanabu aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT huanghongzhan aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT vijayshankerk aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT wucathy aframeworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT lopezluisd frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT yujingyi frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT arighicecilia frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT tudorcatalinao frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT toriimanabu frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT huanghongzhan frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT vijayshankerk frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval AT wucathy frameworkforbiomedicalfiguresegmentationtowardsimagebaseddocumentretrieval

A framework for biomedical figure segmentation towards image-based document retrieval

Ejemplares similares