Cargando…

Utilizing image and caption information for biomedical document classification

MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. Th...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Pengyuan, Jiang, Xiangying, Zhang, Gongbo, Trabucco, Juan Trelles, Raciti, Daniela, Smith, Cynthia, Ringwald, Martin, Marai, G Elisabeta, Arighi, Cecilia, Shatkay, Hagit
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8346654/
https://www.ncbi.nlm.nih.gov/pubmed/34252939
http://dx.doi.org/10.1093/bioinformatics/btab331
_version_ 1783734922910892032
author Li, Pengyuan
Jiang, Xiangying
Zhang, Gongbo
Trabucco, Juan Trelles
Raciti, Daniela
Smith, Cynthia
Ringwald, Martin
Marai, G Elisabeta
Arighi, Cecilia
Shatkay, Hagit
author_facet Li, Pengyuan
Jiang, Xiangying
Zhang, Gongbo
Trabucco, Juan Trelles
Raciti, Daniela
Smith, Cynthia
Ringwald, Martin
Marai, G Elisabeta
Arighi, Cecilia
Shatkay, Hagit
author_sort Li, Pengyuan
collection PubMed
description MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. RESULTS: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. AVAILABILITY AND IMPLEMENTATION: Source code and the list of PMIDs of the publications in our datasets are available upon request.
format Online
Article
Text
id pubmed-8346654
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-83466542021-08-09 Utilizing image and caption information for biomedical document classification Li, Pengyuan Jiang, Xiangying Zhang, Gongbo Trabucco, Juan Trelles Raciti, Daniela Smith, Cynthia Ringwald, Martin Marai, G Elisabeta Arighi, Cecilia Shatkay, Hagit Bioinformatics General Computational Biology MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. RESULTS: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. AVAILABILITY AND IMPLEMENTATION: Source code and the list of PMIDs of the publications in our datasets are available upon request. Oxford University Press 2021-07-12 /pmc/articles/PMC8346654/ /pubmed/34252939 http://dx.doi.org/10.1093/bioinformatics/btab331 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle General Computational Biology
Li, Pengyuan
Jiang, Xiangying
Zhang, Gongbo
Trabucco, Juan Trelles
Raciti, Daniela
Smith, Cynthia
Ringwald, Martin
Marai, G Elisabeta
Arighi, Cecilia
Shatkay, Hagit
Utilizing image and caption information for biomedical document classification
title Utilizing image and caption information for biomedical document classification
title_full Utilizing image and caption information for biomedical document classification
title_fullStr Utilizing image and caption information for biomedical document classification
title_full_unstemmed Utilizing image and caption information for biomedical document classification
title_short Utilizing image and caption information for biomedical document classification
title_sort utilizing image and caption information for biomedical document classification
topic General Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8346654/
https://www.ncbi.nlm.nih.gov/pubmed/34252939
http://dx.doi.org/10.1093/bioinformatics/btab331
work_keys_str_mv AT lipengyuan utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT jiangxiangying utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT zhanggongbo utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT trabuccojuantrelles utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT racitidaniela utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT smithcynthia utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT ringwaldmartin utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT maraigelisabeta utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT arighicecilia utilizingimageandcaptioninformationforbiomedicaldocumentclassification
AT shatkayhagit utilizingimageandcaptioninformationforbiomedicaldocumentclassification