Cargando…

Multi-label literature classification based on the Gene Ontology graph

BACKGROUND: The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges...

Descripción completa

Detalles Bibliográficos
Autores principales: Jin, Bo, Muller, Brian, Zhai, Chengxiang, Lu, Xinghua
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2644325/
https://www.ncbi.nlm.nih.gov/pubmed/19063730
http://dx.doi.org/10.1186/1471-2105-9-525
_version_ 1782164722386206720
author Jin, Bo
Muller, Brian
Zhai, Chengxiang
Lu, Xinghua
author_facet Jin, Bo
Muller, Brian
Zhai, Chengxiang
Lu, Xinghua
author_sort Jin, Bo
collection PubMed
description BACKGROUND: The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. RESULTS: In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. CONCLUSION: Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
format Text
id pubmed-2644325
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26443252009-02-18 Multi-label literature classification based on the Gene Ontology graph Jin, Bo Muller, Brian Zhai, Chengxiang Lu, Xinghua BMC Bioinformatics Methodology Article BACKGROUND: The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. RESULTS: In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. CONCLUSION: Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature. BioMed Central 2008-12-08 /pmc/articles/PMC2644325/ /pubmed/19063730 http://dx.doi.org/10.1186/1471-2105-9-525 Text en Copyright © 2008 Jin et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Jin, Bo
Muller, Brian
Zhai, Chengxiang
Lu, Xinghua
Multi-label literature classification based on the Gene Ontology graph
title Multi-label literature classification based on the Gene Ontology graph
title_full Multi-label literature classification based on the Gene Ontology graph
title_fullStr Multi-label literature classification based on the Gene Ontology graph
title_full_unstemmed Multi-label literature classification based on the Gene Ontology graph
title_short Multi-label literature classification based on the Gene Ontology graph
title_sort multi-label literature classification based on the gene ontology graph
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2644325/
https://www.ncbi.nlm.nih.gov/pubmed/19063730
http://dx.doi.org/10.1186/1471-2105-9-525
work_keys_str_mv AT jinbo multilabelliteratureclassificationbasedonthegeneontologygraph
AT mullerbrian multilabelliteratureclassificationbasedonthegeneontologygraph
AT zhaichengxiang multilabelliteratureclassificationbasedonthegeneontologygraph
AT luxinghua multilabelliteratureclassificationbasedonthegeneontologygraph