Cargando…
GeneTopics - interpretation of gene sets via literature-driven topic models
BACKGROUND: Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevan...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029197/ https://www.ncbi.nlm.nih.gov/pubmed/24564875 http://dx.doi.org/10.1186/1752-0509-7-S5-S10 |
_version_ | 1782317170954338304 |
---|---|
author | Wang, Vicky Xi, Li Enayetallah, Ahmed Fauman, Eric Ziemek, Daniel |
author_facet | Wang, Vicky Xi, Li Enayetallah, Ahmed Fauman, Eric Ziemek, Daniel |
author_sort | Wang, Vicky |
collection | PubMed |
description | BACKGROUND: Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. METHODS: Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. RESULTS: We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. CONCLUSIONS: Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets. |
format | Online Article Text |
id | pubmed-4029197 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-40291972014-06-19 GeneTopics - interpretation of gene sets via literature-driven topic models Wang, Vicky Xi, Li Enayetallah, Ahmed Fauman, Eric Ziemek, Daniel BMC Syst Biol Research BACKGROUND: Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. METHODS: Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. RESULTS: We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. CONCLUSIONS: Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets. BioMed Central 2013-12-09 /pmc/articles/PMC4029197/ /pubmed/24564875 http://dx.doi.org/10.1186/1752-0509-7-S5-S10 Text en Copyright © 2013 Wang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Wang, Vicky Xi, Li Enayetallah, Ahmed Fauman, Eric Ziemek, Daniel GeneTopics - interpretation of gene sets via literature-driven topic models |
title | GeneTopics - interpretation of gene sets via literature-driven topic models |
title_full | GeneTopics - interpretation of gene sets via literature-driven topic models |
title_fullStr | GeneTopics - interpretation of gene sets via literature-driven topic models |
title_full_unstemmed | GeneTopics - interpretation of gene sets via literature-driven topic models |
title_short | GeneTopics - interpretation of gene sets via literature-driven topic models |
title_sort | genetopics - interpretation of gene sets via literature-driven topic models |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029197/ https://www.ncbi.nlm.nih.gov/pubmed/24564875 http://dx.doi.org/10.1186/1752-0509-7-S5-S10 |
work_keys_str_mv | AT wangvicky genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels AT xili genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels AT enayetallahahmed genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels AT faumaneric genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels AT ziemekdaniel genetopicsinterpretationofgenesetsvialiteraturedriventopicmodels |