Cargando…

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-da...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Xiangying, Ringwald, Martin, Blake, Judith A, Arighi, Cecilia, Zhang, Gongbo, Shatkay, Hagit
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6482935/
https://www.ncbi.nlm.nih.gov/pubmed/31032839
http://dx.doi.org/10.1093/database/baz045
_version_ 1783413969007935488
author Jiang, Xiangying
Ringwald, Martin
Blake, Judith A
Arighi, Cecilia
Zhang, Gongbo
Shatkay, Hagit
author_facet Jiang, Xiangying
Ringwald, Martin
Blake, Judith A
Arighi, Cecilia
Zhang, Gongbo
Shatkay, Hagit
author_sort Jiang, Xiangying
collection PubMed
description Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory’s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
format Online
Article
Text
id pubmed-6482935
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-64829352019-04-29 An effective biomedical document classification scheme in support of biocuration: addressing class imbalance Jiang, Xiangying Ringwald, Martin Blake, Judith A Arighi, Cecilia Zhang, Gongbo Shatkay, Hagit Database (Oxford) Original Article Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory’s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance. Oxford University Press 2019-04-25 /pmc/articles/PMC6482935/ /pubmed/31032839 http://dx.doi.org/10.1093/database/baz045 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Jiang, Xiangying
Ringwald, Martin
Blake, Judith A
Arighi, Cecilia
Zhang, Gongbo
Shatkay, Hagit
An effective biomedical document classification scheme in support of biocuration: addressing class imbalance
title An effective biomedical document classification scheme in support of biocuration: addressing class imbalance
title_full An effective biomedical document classification scheme in support of biocuration: addressing class imbalance
title_fullStr An effective biomedical document classification scheme in support of biocuration: addressing class imbalance
title_full_unstemmed An effective biomedical document classification scheme in support of biocuration: addressing class imbalance
title_short An effective biomedical document classification scheme in support of biocuration: addressing class imbalance
title_sort effective biomedical document classification scheme in support of biocuration: addressing class imbalance
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6482935/
https://www.ncbi.nlm.nih.gov/pubmed/31032839
http://dx.doi.org/10.1093/database/baz045
work_keys_str_mv AT jiangxiangying aneffectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT ringwaldmartin aneffectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT blakejuditha aneffectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT arighicecilia aneffectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT zhanggongbo aneffectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT shatkayhagit aneffectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT jiangxiangying effectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT ringwaldmartin effectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT blakejuditha effectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT arighicecilia effectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT zhanggongbo effectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance
AT shatkayhagit effectivebiomedicaldocumentclassificationschemeinsupportofbiocurationaddressingclassimbalance