Cargando…

Identifying biological concepts from a protein-related corpus with a probabilistic topic model

BACKGROUND: Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concept...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng, Bin, McLean, David C, Lu, Xinghua
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1420333/
https://www.ncbi.nlm.nih.gov/pubmed/16466569
http://dx.doi.org/10.1186/1471-2105-7-58
_version_ 1782127150089895936
author Zheng, Bin
McLean, David C
Lu, Xinghua
author_facet Zheng, Bin
McLean, David C
Lu, Xinghua
author_sort Zheng, Bin
collection PubMed
description BACKGROUND: Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE(© )titles and abstracts by applying a probabilistic topic model. RESULTS: The latent Dirichlet allocation (LDA) model was applied to the corpus. Based on the Bayesian model selection, 300 major topics were extracted from the corpus. The majority of identified topics/concepts was found to be semantically coherent and most represented biological objects or concepts. The identified topics/concepts were further mapped to the controlled vocabulary of the Gene Ontology (GO) terms based on mutual information. CONCLUSION: The major and recurring biological concepts within a collection of MEDLINE documents can be extracted by the LDA model. The identified topics/concepts provide parsimonious and semantically-enriched representation of the texts in a semantic space with reduced dimensionality and can be used to index text.
format Text
id pubmed-1420333
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14203332006-04-21 Identifying biological concepts from a protein-related corpus with a probabilistic topic model Zheng, Bin McLean, David C Lu, Xinghua BMC Bioinformatics Research Article BACKGROUND: Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE(© )titles and abstracts by applying a probabilistic topic model. RESULTS: The latent Dirichlet allocation (LDA) model was applied to the corpus. Based on the Bayesian model selection, 300 major topics were extracted from the corpus. The majority of identified topics/concepts was found to be semantically coherent and most represented biological objects or concepts. The identified topics/concepts were further mapped to the controlled vocabulary of the Gene Ontology (GO) terms based on mutual information. CONCLUSION: The major and recurring biological concepts within a collection of MEDLINE documents can be extracted by the LDA model. The identified topics/concepts provide parsimonious and semantically-enriched representation of the texts in a semantic space with reduced dimensionality and can be used to index text. BioMed Central 2006-02-08 /pmc/articles/PMC1420333/ /pubmed/16466569 http://dx.doi.org/10.1186/1471-2105-7-58 Text en Copyright © 2006 Zheng et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Zheng, Bin
McLean, David C
Lu, Xinghua
Identifying biological concepts from a protein-related corpus with a probabilistic topic model
title Identifying biological concepts from a protein-related corpus with a probabilistic topic model
title_full Identifying biological concepts from a protein-related corpus with a probabilistic topic model
title_fullStr Identifying biological concepts from a protein-related corpus with a probabilistic topic model
title_full_unstemmed Identifying biological concepts from a protein-related corpus with a probabilistic topic model
title_short Identifying biological concepts from a protein-related corpus with a probabilistic topic model
title_sort identifying biological concepts from a protein-related corpus with a probabilistic topic model
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1420333/
https://www.ncbi.nlm.nih.gov/pubmed/16466569
http://dx.doi.org/10.1186/1471-2105-7-58
work_keys_str_mv AT zhengbin identifyingbiologicalconceptsfromaproteinrelatedcorpuswithaprobabilistictopicmodel
AT mcleandavidc identifyingbiologicalconceptsfromaproteinrelatedcorpuswithaprobabilistictopicmodel
AT luxinghua identifyingbiologicalconceptsfromaproteinrelatedcorpuswithaprobabilistictopicmodel