Cargando…

Family classification without domain chaining

Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify de...

Descripción completa

Detalles Bibliográficos
Autores principales: Joseph, Jacob M., Durand, Dannie
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2687961/
https://www.ncbi.nlm.nih.gov/pubmed/19478015
http://dx.doi.org/10.1093/bioinformatics/btp207
_version_ 1782167630357987328
author Joseph, Jacob M.
Durand, Dannie
author_facet Joseph, Jacob M.
Durand, Dannie
author_sort Joseph, Jacob M.
collection PubMed
description Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms. Results: Here, we investigate a network-rewiring strategy designed to eliminate edges due to promiscuous domains. We show that this strategy can reduce noise in and restore structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluate this approach on a hand-curated set of multidomain sequences in mouse and human, and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods. Families in our test set exhibit a broad range of domain architectures and sequence conservation, demonstrating that our method is flexible, robust and suitable for high-throughput, automated processing of heterogeneous, genome-scale data. contact: jacobmj@cmu.edu
format Text
id pubmed-2687961
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-26879612009-06-02 Family classification without domain chaining Joseph, Jacob M. Durand, Dannie Bioinformatics Ismb/Eccb 2009 Conference Proceedings June 27 to July 2, 2009, Stockholm, Sweden Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms. Results: Here, we investigate a network-rewiring strategy designed to eliminate edges due to promiscuous domains. We show that this strategy can reduce noise in and restore structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluate this approach on a hand-curated set of multidomain sequences in mouse and human, and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods. Families in our test set exhibit a broad range of domain architectures and sequence conservation, demonstrating that our method is flexible, robust and suitable for high-throughput, automated processing of heterogeneous, genome-scale data. contact: jacobmj@cmu.edu Oxford University Press 2009-06-15 2009-05-27 /pmc/articles/PMC2687961/ /pubmed/19478015 http://dx.doi.org/10.1093/bioinformatics/btp207 Text en © 2009 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Ismb/Eccb 2009 Conference Proceedings June 27 to July 2, 2009, Stockholm, Sweden
Joseph, Jacob M.
Durand, Dannie
Family classification without domain chaining
title Family classification without domain chaining
title_full Family classification without domain chaining
title_fullStr Family classification without domain chaining
title_full_unstemmed Family classification without domain chaining
title_short Family classification without domain chaining
title_sort family classification without domain chaining
topic Ismb/Eccb 2009 Conference Proceedings June 27 to July 2, 2009, Stockholm, Sweden
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2687961/
https://www.ncbi.nlm.nih.gov/pubmed/19478015
http://dx.doi.org/10.1093/bioinformatics/btp207
work_keys_str_mv AT josephjacobm familyclassificationwithoutdomainchaining
AT duranddannie familyclassificationwithoutdomainchaining