Cargando…

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

BACKGROUND: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-te...

Descripción completa

Detalles Bibliográficos
Autores principales: Blei, DM, Franks, K, Jordan, MI, Mian, IS
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533868/
https://www.ncbi.nlm.nih.gov/pubmed/16681860
http://dx.doi.org/10.1186/1471-2105-7-250
_version_ 1782129077209006080
author Blei, DM
Franks, K
Jordan, MI
Mian, IS
author_facet Blei, DM
Franks, K
Jordan, MI
Mian, IS
author_sort Blei, DM
collection PubMed
description BACKGROUND: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. RESULTS: An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. CONCLUSION: Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.
format Text
id pubmed-1533868
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15338682006-08-08 Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span Blei, DM Franks, K Jordan, MI Mian, IS BMC Bioinformatics Research Article BACKGROUND: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. RESULTS: An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. CONCLUSION: Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation. BioMed Central 2006-05-08 /pmc/articles/PMC1533868/ /pubmed/16681860 http://dx.doi.org/10.1186/1471-2105-7-250 Text en Copyright ©2006 Blei et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Blei, DM
Franks, K
Jordan, MI
Mian, IS
Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
title Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
title_full Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
title_fullStr Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
title_full_unstemmed Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
title_short Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
title_sort statistical modeling of biomedical corpora: mining the caenorhabditis genetic center bibliography for genes related to life span
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533868/
https://www.ncbi.nlm.nih.gov/pubmed/16681860
http://dx.doi.org/10.1186/1471-2105-7-250
work_keys_str_mv AT bleidm statisticalmodelingofbiomedicalcorporaminingthecaenorhabditisgeneticcenterbibliographyforgenesrelatedtolifespan
AT franksk statisticalmodelingofbiomedicalcorporaminingthecaenorhabditisgeneticcenterbibliographyforgenesrelatedtolifespan
AT jordanmi statisticalmodelingofbiomedicalcorporaminingthecaenorhabditisgeneticcenterbibliographyforgenesrelatedtolifespan
AT mianis statisticalmodelingofbiomedicalcorporaminingthecaenorhabditisgeneticcenterbibliographyforgenesrelatedtolifespan