Cargando…

How to decide which are the most pertinent overly-represented features during gene set enrichment analysis

BACKGROUND: The search for enriched features has become widely used to characterize a set of genes or proteins. A key aspect of this technique is its ability to identify correlations amongst heterogeneous data such as Gene Ontology annotations, gene expression data and genome location of genes. Desp...

Descripción completa

Detalles Bibliográficos
Autores principales:	Barriot, Roland, Sherman, David J, Dutour, Isabelle
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2206060/ https://www.ncbi.nlm.nih.gov/pubmed/17848190 http://dx.doi.org/10.1186/1471-2105-8-332

_version_	1782148440584617984
author	Barriot, Roland Sherman, David J Dutour, Isabelle
author_facet	Barriot, Roland Sherman, David J Dutour, Isabelle
author_sort	Barriot, Roland
collection	PubMed
description	BACKGROUND: The search for enriched features has become widely used to characterize a set of genes or proteins. A key aspect of this technique is its ability to identify correlations amongst heterogeneous data such as Gene Ontology annotations, gene expression data and genome location of genes. Despite the rapid growth of available data, very little has been proposed in terms of formalization and optimization. Additionally, current methods mainly ignore the structure of the data which causes results redundancy. For example, when searching for enrichment in GO terms, genes can be annotated with multiple GO terms and should be propagated to the more general terms in the Gene Ontology. Consequently, the gene sets often overlap partially or totally, and this causes the reported enriched GO terms to be both numerous and redundant, hence, overwhelming the researcher with non-pertinent information. This situation is not unique, it arises whenever some hierarchical clustering is performed (e.g. based on the gene expression profiles), the extreme case being when genes that are neighbors on the chromosomes are considered. RESULTS: We present a generic framework to efficiently identify the most pertinent over-represented features in a set of genes. We propose a formal representation of gene sets based on the theory of partially ordered sets (posets), and give a formal definition of target set pertinence. Algorithms and compact representations of target sets are provided for the generation and the evaluation of the pertinent target sets. The relevance of our method is illustrated through the search for enriched GO annotations in the proteins involved in a multiprotein complex. The results obtained demonstrate the gain in terms of pertinence (up to 64% redundancy removed), space requirements (up to 73% less storage) and efficiency (up to 98% less comparisons). CONCLUSION: The generic framework presented in this article provides a formal approach to adequately represent available data and efficiently search for pertinent over-represented features in a set of genes or proteins. The formalism and the pertinence definition can be directly used by most of the methods and tools currently available for feature enrichment analysis.
format	Text
id	pubmed-2206060
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22060602008-01-18 How to decide which are the most pertinent overly-represented features during gene set enrichment analysis Barriot, Roland Sherman, David J Dutour, Isabelle BMC Bioinformatics Methodology Article BACKGROUND: The search for enriched features has become widely used to characterize a set of genes or proteins. A key aspect of this technique is its ability to identify correlations amongst heterogeneous data such as Gene Ontology annotations, gene expression data and genome location of genes. Despite the rapid growth of available data, very little has been proposed in terms of formalization and optimization. Additionally, current methods mainly ignore the structure of the data which causes results redundancy. For example, when searching for enrichment in GO terms, genes can be annotated with multiple GO terms and should be propagated to the more general terms in the Gene Ontology. Consequently, the gene sets often overlap partially or totally, and this causes the reported enriched GO terms to be both numerous and redundant, hence, overwhelming the researcher with non-pertinent information. This situation is not unique, it arises whenever some hierarchical clustering is performed (e.g. based on the gene expression profiles), the extreme case being when genes that are neighbors on the chromosomes are considered. RESULTS: We present a generic framework to efficiently identify the most pertinent over-represented features in a set of genes. We propose a formal representation of gene sets based on the theory of partially ordered sets (posets), and give a formal definition of target set pertinence. Algorithms and compact representations of target sets are provided for the generation and the evaluation of the pertinent target sets. The relevance of our method is illustrated through the search for enriched GO annotations in the proteins involved in a multiprotein complex. The results obtained demonstrate the gain in terms of pertinence (up to 64% redundancy removed), space requirements (up to 73% less storage) and efficiency (up to 98% less comparisons). CONCLUSION: The generic framework presented in this article provides a formal approach to adequately represent available data and efficiently search for pertinent over-represented features in a set of genes or proteins. The formalism and the pertinence definition can be directly used by most of the methods and tools currently available for feature enrichment analysis. BioMed Central 2007-09-11 /pmc/articles/PMC2206060/ /pubmed/17848190 http://dx.doi.org/10.1186/1471-2105-8-332 Text en Copyright © 2007 Barriot et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Barriot, Roland Sherman, David J Dutour, Isabelle How to decide which are the most pertinent overly-represented features during gene set enrichment analysis
title	How to decide which are the most pertinent overly-represented features during gene set enrichment analysis
title_full	How to decide which are the most pertinent overly-represented features during gene set enrichment analysis
title_fullStr	How to decide which are the most pertinent overly-represented features during gene set enrichment analysis
title_full_unstemmed	How to decide which are the most pertinent overly-represented features during gene set enrichment analysis
title_short	How to decide which are the most pertinent overly-represented features during gene set enrichment analysis
title_sort	how to decide which are the most pertinent overly-represented features during gene set enrichment analysis
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2206060/ https://www.ncbi.nlm.nih.gov/pubmed/17848190 http://dx.doi.org/10.1186/1471-2105-8-332
work_keys_str_mv	AT barriotroland howtodecidewhicharethemostpertinentoverlyrepresentedfeaturesduringgenesetenrichmentanalysis AT shermandavidj howtodecidewhicharethemostpertinentoverlyrepresentedfeaturesduringgenesetenrichmentanalysis AT dutourisabelle howtodecidewhicharethemostpertinentoverlyrepresentedfeaturesduringgenesetenrichmentanalysis

How to decide which are the most pertinent overly-represented features during gene set enrichment analysis

Ejemplares similares