Cargando…

Gene Cluster Statistics with Gene Families

Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such “ge...

Descripción completa

Detalles Bibliográficos
Autores principales:	Raghupathy, Narayanan, Durand, Dannie
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2009
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2668827/ https://www.ncbi.nlm.nih.gov/pubmed/19150803 http://dx.doi.org/10.1093/molbev/msp002

_version_	1782166208210010112
author	Raghupathy, Narayanan Durand, Dannie
author_facet	Raghupathy, Narayanan Durand, Dannie
author_sort	Raghupathy, Narayanan
collection	PubMed
description	Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such “gene clusters” is an essential component of comparative genomic analyses. However, currently there are no practical statistical tests for gene clusters that model the influence of the number of homologs in each gene family on cluster significance. In this work, we demonstrate empirically that failure to incorporate gene family size in gene cluster statistics results in overestimation of significance, leading to incorrect conclusions. We further present novel analytical methods for estimating gene cluster significance that take gene family size into account. Our methods do not require complete genome data and are suitable for testing individual clusters found in local regions, such as contigs in an unfinished assembly. We consider pairs of regions drawn from the same genome (paralogous clusters), as well as regions drawn from two different genomes (orthologous clusters). Determining cluster significance under general models of gene family size is computationally intractable. By assuming that all gene families are of equal size, we obtain analytical expressions that allow fast approximation of cluster probabilities. We evaluate the accuracy of this approximation by comparing the resulting gene cluster probabilities with cluster probabilities obtained by simulating a realistic, power-law distributed model of gene family size, with parameters inferred from genomic data. Surprisingly, despite the simplicity of the underlying assumption, our method accurately approximates the true cluster probabilities. It slightly overestimates these probabilities, yielding a conservative test. We present additional simulation results indicating the best choice of parameter values for data analysis in genomes of various sizes and illustrate the utility of our methods by applying them to gene clusters recently reported in the literature. Mathematica code to compute cluster probabilities using our methods is available as supplementary material.
format	Text
id	pubmed-2668827
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-26688272009-04-20 Gene Cluster Statistics with Gene Families Raghupathy, Narayanan Durand, Dannie Mol Biol Evol Research Articles Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such “gene clusters” is an essential component of comparative genomic analyses. However, currently there are no practical statistical tests for gene clusters that model the influence of the number of homologs in each gene family on cluster significance. In this work, we demonstrate empirically that failure to incorporate gene family size in gene cluster statistics results in overestimation of significance, leading to incorrect conclusions. We further present novel analytical methods for estimating gene cluster significance that take gene family size into account. Our methods do not require complete genome data and are suitable for testing individual clusters found in local regions, such as contigs in an unfinished assembly. We consider pairs of regions drawn from the same genome (paralogous clusters), as well as regions drawn from two different genomes (orthologous clusters). Determining cluster significance under general models of gene family size is computationally intractable. By assuming that all gene families are of equal size, we obtain analytical expressions that allow fast approximation of cluster probabilities. We evaluate the accuracy of this approximation by comparing the resulting gene cluster probabilities with cluster probabilities obtained by simulating a realistic, power-law distributed model of gene family size, with parameters inferred from genomic data. Surprisingly, despite the simplicity of the underlying assumption, our method accurately approximates the true cluster probabilities. It slightly overestimates these probabilities, yielding a conservative test. We present additional simulation results indicating the best choice of parameter values for data analysis in genomes of various sizes and illustrate the utility of our methods by applying them to gene clusters recently reported in the literature. Mathematica code to compute cluster probabilities using our methods is available as supplementary material. Oxford University Press 2009-05 2009-01-15 /pmc/articles/PMC2668827/ /pubmed/19150803 http://dx.doi.org/10.1093/molbev/msp002 Text en © 2009 The Authors This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Articles Raghupathy, Narayanan Durand, Dannie Gene Cluster Statistics with Gene Families
title	Gene Cluster Statistics with Gene Families
title_full	Gene Cluster Statistics with Gene Families
title_fullStr	Gene Cluster Statistics with Gene Families
title_full_unstemmed	Gene Cluster Statistics with Gene Families
title_short	Gene Cluster Statistics with Gene Families
title_sort	gene cluster statistics with gene families
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2668827/ https://www.ncbi.nlm.nih.gov/pubmed/19150803 http://dx.doi.org/10.1093/molbev/msp002
work_keys_str_mv	AT raghupathynarayanan geneclusterstatisticswithgenefamilies AT duranddannie geneclusterstatisticswithgenefamilies

Gene Cluster Statistics with Gene Families

Ejemplares similares