Cargando…

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that d...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Xiangying, Ringwald, Martin, Blake, Judith, Shatkay, Hagit
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467553/
https://www.ncbi.nlm.nih.gov/pubmed/28365740
http://dx.doi.org/10.1093/database/bax017
_version_ 1783243287212064768
author Jiang, Xiangying
Ringwald, Martin
Blake, Judith
Shatkay, Hagit
author_facet Jiang, Xiangying
Ringwald, Martin
Blake, Judith
Shatkay, Hagit
author_sort Jiang, Xiangying
collection PubMed
description The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. Database URL: www.informatics.jax.org
format Online
Article
Text
id pubmed-5467553
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-54675532017-06-19 Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD) Jiang, Xiangying Ringwald, Martin Blake, Judith Shatkay, Hagit Database (Oxford) Original Article The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. Database URL: www.informatics.jax.org Oxford University Press 2017-03-24 /pmc/articles/PMC5467553/ /pubmed/28365740 http://dx.doi.org/10.1093/database/bax017 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Jiang, Xiangying
Ringwald, Martin
Blake, Judith
Shatkay, Hagit
Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
title Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
title_full Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
title_fullStr Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
title_full_unstemmed Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
title_short Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
title_sort effective biomedical document classification for identifying publications relevant to the mouse gene expression database (gxd)
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467553/
https://www.ncbi.nlm.nih.gov/pubmed/28365740
http://dx.doi.org/10.1093/database/bax017
work_keys_str_mv AT jiangxiangying effectivebiomedicaldocumentclassificationforidentifyingpublicationsrelevanttothemousegeneexpressiondatabasegxd
AT ringwaldmartin effectivebiomedicaldocumentclassificationforidentifyingpublicationsrelevanttothemousegeneexpressiondatabasegxd
AT blakejudith effectivebiomedicaldocumentclassificationforidentifyingpublicationsrelevanttothemousegeneexpressiondatabasegxd
AT shatkayhagit effectivebiomedicaldocumentclassificationforidentifyingpublicationsrelevanttothemousegeneexpressiondatabasegxd