Cargando…

GeneTopics - interpretation of gene sets via literature-driven topic models

BACKGROUND: Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevan...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Vicky, Xi, Li, Enayetallah, Ahmed, Fauman, Eric, Ziemek, Daniel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029197/
https://www.ncbi.nlm.nih.gov/pubmed/24564875
http://dx.doi.org/10.1186/1752-0509-7-S5-S10
_version_ 1782317170954338304
author Wang, Vicky
Xi, Li
Enayetallah, Ahmed
Fauman, Eric
Ziemek, Daniel
author_facet Wang, Vicky
Xi, Li
Enayetallah, Ahmed
Fauman, Eric
Ziemek, Daniel
author_sort Wang, Vicky
collection PubMed
description BACKGROUND: Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. METHODS: Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. RESULTS: We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. CONCLUSIONS: Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets.
format Online
Article
Text
id pubmed-4029197
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40291972014-06-19 GeneTopics - interpretation of gene sets via literature-driven topic models Wang, Vicky Xi, Li Enayetallah, Ahmed Fauman, Eric Ziemek, Daniel BMC Syst Biol Research BACKGROUND: Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. METHODS: Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. RESULTS: We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. CONCLUSIONS: Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets. BioMed Central 2013-12-09 /pmc/articles/PMC4029197/ /pubmed/24564875 http://dx.doi.org/10.1186/1752-0509-7-S5-S10 Text en Copyright © 2013 Wang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Wang, Vicky
Xi, Li
Enayetallah, Ahmed
Fauman, Eric
Ziemek, Daniel
GeneTopics - interpretation of gene sets via literature-driven topic models
title GeneTopics - interpretation of gene sets via literature-driven topic models
title_full GeneTopics - interpretation of gene sets via literature-driven topic models
title_fullStr GeneTopics - interpretation of gene sets via literature-driven topic models
title_full_unstemmed GeneTopics - interpretation of gene sets via literature-driven topic models
title_short GeneTopics - interpretation of gene sets via literature-driven topic models
title_sort genetopics - interpretation of gene sets via literature-driven topic models
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029197/
https://www.ncbi.nlm.nih.gov/pubmed/24564875
http://dx.doi.org/10.1186/1752-0509-7-S5-S10
work_keys_str_mv AT wangvicky genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels
AT xili genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels
AT enayetallahahmed genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels
AT faumaneric genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels
AT ziemekdaniel genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels