Cargando…

Integrating image caption information into biomedical document classification in support of biocuration

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jiang, Xiangying, Li, Pengyuan, Kadin, James, Blake, Judith A, Ringwald, Martin, Shatkay, Hagit
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7159034/ https://www.ncbi.nlm.nih.gov/pubmed/32294192 http://dx.doi.org/10.1093/database/baaa024

_version_	1783522596308910080
author	Jiang, Xiangying Li, Pengyuan Kadin, James Blake, Judith A Ringwald, Martin Shatkay, Hagit
author_facet	Jiang, Xiangying Li, Pengyuan Kadin, James Blake, Judith A Ringwald, Martin Shatkay, Hagit
author_sort	Jiang, Xiangying
collection	PubMed
description	Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL:
format	Online Article Text
id	pubmed-7159034
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-71590342020-04-20 Integrating image caption information into biomedical document classification in support of biocuration Jiang, Xiangying Li, Pengyuan Kadin, James Blake, Judith A Ringwald, Martin Shatkay, Hagit Database (Oxford) Original Article Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL: Oxford University Press 2020-04-15 /pmc/articles/PMC7159034/ /pubmed/32294192 http://dx.doi.org/10.1093/database/baaa024 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Jiang, Xiangying Li, Pengyuan Kadin, James Blake, Judith A Ringwald, Martin Shatkay, Hagit Integrating image caption information into biomedical document classification in support of biocuration
title	Integrating image caption information into biomedical document classification in support of biocuration
title_full	Integrating image caption information into biomedical document classification in support of biocuration
title_fullStr	Integrating image caption information into biomedical document classification in support of biocuration
title_full_unstemmed	Integrating image caption information into biomedical document classification in support of biocuration
title_short	Integrating image caption information into biomedical document classification in support of biocuration
title_sort	integrating image caption information into biomedical document classification in support of biocuration
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7159034/ https://www.ncbi.nlm.nih.gov/pubmed/32294192 http://dx.doi.org/10.1093/database/baaa024
work_keys_str_mv	AT jiangxiangying integratingimagecaptioninformationintobiomedicaldocumentclassificationinsupportofbiocuration AT lipengyuan integratingimagecaptioninformationintobiomedicaldocumentclassificationinsupportofbiocuration AT kadinjames integratingimagecaptioninformationintobiomedicaldocumentclassificationinsupportofbiocuration AT blakejuditha integratingimagecaptioninformationintobiomedicaldocumentclassificationinsupportofbiocuration AT ringwaldmartin integratingimagecaptioninformationintobiomedicaldocumentclassificationinsupportofbiocuration AT shatkayhagit integratingimagecaptioninformationintobiomedicaldocumentclassificationinsupportofbiocuration

Integrating image caption information into biomedical document classification in support of biocuration

Ejemplares similares