Cargando…

Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins

We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accura...

Descripción completa

Detalles Bibliográficos
Autores principales:	Song, Nan, Joseph, Jacob M., Davis, George B., Durand, Dannie
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2377100/ https://www.ncbi.nlm.nih.gov/pubmed/18475320 http://dx.doi.org/10.1371/journal.pcbi.1000063

_version_	1782154777442910208
author	Song, Nan Joseph, Jacob M. Davis, George B. Durand, Dannie
author_facet	Song, Nan Joseph, Jacob M. Davis, George B. Durand, Dannie
author_sort	Song, Nan
collection	PubMed
description	We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era.
format	Text
id	pubmed-2377100
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-23771002008-05-16 Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins Song, Nan Joseph, Jacob M. Davis, George B. Durand, Dannie PLoS Comput Biol Research Article We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era. Public Library of Science 2008-05-16 /pmc/articles/PMC2377100/ /pubmed/18475320 http://dx.doi.org/10.1371/journal.pcbi.1000063 Text en Song et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Song, Nan Joseph, Jacob M. Davis, George B. Durand, Dannie Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
title	Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
title_full	Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
title_fullStr	Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
title_full_unstemmed	Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
title_short	Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins
title_sort	sequence similarity network reveals common ancestry of multidomain proteins
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2377100/ https://www.ncbi.nlm.nih.gov/pubmed/18475320 http://dx.doi.org/10.1371/journal.pcbi.1000063
work_keys_str_mv	AT songnan sequencesimilaritynetworkrevealscommonancestryofmultidomainproteins AT josephjacobm sequencesimilaritynetworkrevealscommonancestryofmultidomainproteins AT davisgeorgeb sequencesimilaritynetworkrevealscommonancestryofmultidomainproteins AT duranddannie sequencesimilaritynetworkrevealscommonancestryofmultidomainproteins

Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins

Ejemplares similares