Cargando…

GONOME: measuring correlations between GO terms and genomic positions

BACKGROUND: Current methods to find significantly under- and over-represented gene ontology (GO) terms in a set of genes consider the genes as equally probable "balls in a bag", as may be appropriate for transcripts in micro-array data. However, due to the varying length of genes and inter...

Descripción completa

Detalles Bibliográficos
Autores principales: Stanley, Stefan M, Bailey, Timothy L, Mattick, John S
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1413564/
https://www.ncbi.nlm.nih.gov/pubmed/16504139
http://dx.doi.org/10.1186/1471-2105-7-94
_version_ 1782127087389245440
author Stanley, Stefan M
Bailey, Timothy L
Mattick, John S
author_facet Stanley, Stefan M
Bailey, Timothy L
Mattick, John S
author_sort Stanley, Stefan M
collection PubMed
description BACKGROUND: Current methods to find significantly under- and over-represented gene ontology (GO) terms in a set of genes consider the genes as equally probable "balls in a bag", as may be appropriate for transcripts in micro-array data. However, due to the varying length of genes and intergenic regions, that approach is inappropriate for deciding if any GO terms are correlated with a set of genomic positions. RESULTS: We present an algorithm – GONOME – that can determine which GO terms are significantly associated with a set of genomic positions given a genome annotated with (at least) the starts and ends of genes. We show that certain GO terms may appear to be significantly associated with a set of randomly chosen positions in the human genome if gene lengths are not considered, and that these same terms have been reported as significantly over-represented in a number of recent papers. This apparent over-representation disappears when gene lengths are considered, as GONOME does. For example, we show that, when gene length is taken into account, the term "development" is not significantly enriched in genes associated with human CpG islands, in contradiction to a previous report. We further demonstrate the efficacy of GONOME by showing that occurrences of the proteosome-associated control element (PACE) upstream activating sequence in the S. cerevisiae genome associate significantly to appropriate GO terms. An extension of this approach yields a whole-genome motif discovery algorithm that allows identification of many other promoter sequences linked to different types of genes, including a large group of previously unknown motifs significantly associated with the terms 'translation' and 'translational elongation'. CONCLUSION: GONOME is an algorithm that correctly extracts over-represented GO terms from a set of genomic positions. By explicitly considering gene size, GONOME avoids a systematic bias toward GO terms linked to large genes. Inappropriate use of existing algorithms that do not take gene size into account has led to erroneous or suspect conclusions. Reciprocally GONOME may be used to identify new features in genomes that are significantly associated with particular categories of genes.
format Text
id pubmed-1413564
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14135642006-04-21 GONOME: measuring correlations between GO terms and genomic positions Stanley, Stefan M Bailey, Timothy L Mattick, John S BMC Bioinformatics Methodology Article BACKGROUND: Current methods to find significantly under- and over-represented gene ontology (GO) terms in a set of genes consider the genes as equally probable "balls in a bag", as may be appropriate for transcripts in micro-array data. However, due to the varying length of genes and intergenic regions, that approach is inappropriate for deciding if any GO terms are correlated with a set of genomic positions. RESULTS: We present an algorithm – GONOME – that can determine which GO terms are significantly associated with a set of genomic positions given a genome annotated with (at least) the starts and ends of genes. We show that certain GO terms may appear to be significantly associated with a set of randomly chosen positions in the human genome if gene lengths are not considered, and that these same terms have been reported as significantly over-represented in a number of recent papers. This apparent over-representation disappears when gene lengths are considered, as GONOME does. For example, we show that, when gene length is taken into account, the term "development" is not significantly enriched in genes associated with human CpG islands, in contradiction to a previous report. We further demonstrate the efficacy of GONOME by showing that occurrences of the proteosome-associated control element (PACE) upstream activating sequence in the S. cerevisiae genome associate significantly to appropriate GO terms. An extension of this approach yields a whole-genome motif discovery algorithm that allows identification of many other promoter sequences linked to different types of genes, including a large group of previously unknown motifs significantly associated with the terms 'translation' and 'translational elongation'. CONCLUSION: GONOME is an algorithm that correctly extracts over-represented GO terms from a set of genomic positions. By explicitly considering gene size, GONOME avoids a systematic bias toward GO terms linked to large genes. Inappropriate use of existing algorithms that do not take gene size into account has led to erroneous or suspect conclusions. Reciprocally GONOME may be used to identify new features in genomes that are significantly associated with particular categories of genes. BioMed Central 2006-02-25 /pmc/articles/PMC1413564/ /pubmed/16504139 http://dx.doi.org/10.1186/1471-2105-7-94 Text en Copyright © 2006 Stanley et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Stanley, Stefan M
Bailey, Timothy L
Mattick, John S
GONOME: measuring correlations between GO terms and genomic positions
title GONOME: measuring correlations between GO terms and genomic positions
title_full GONOME: measuring correlations between GO terms and genomic positions
title_fullStr GONOME: measuring correlations between GO terms and genomic positions
title_full_unstemmed GONOME: measuring correlations between GO terms and genomic positions
title_short GONOME: measuring correlations between GO terms and genomic positions
title_sort gonome: measuring correlations between go terms and genomic positions
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1413564/
https://www.ncbi.nlm.nih.gov/pubmed/16504139
http://dx.doi.org/10.1186/1471-2105-7-94
work_keys_str_mv AT stanleystefanm gonomemeasuringcorrelationsbetweengotermsandgenomicpositions
AT baileytimothyl gonomemeasuringcorrelationsbetweengotermsandgenomicpositions
AT mattickjohns gonomemeasuringcorrelationsbetweengotermsandgenomicpositions