Cargando…

SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification

With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the...

Descripción completa

Detalles Bibliográficos
Autores principales: Fan, Ming, Wong, Ka-Chun, Ryu, Taewoo, Ravasi, Timothy, Gao, Xin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3386278/
https://www.ncbi.nlm.nih.gov/pubmed/22761802
http://dx.doi.org/10.1371/journal.pone.0039475
_version_ 1782236963091251200
author Fan, Ming
Wong, Ka-Chun
Ryu, Taewoo
Ravasi, Timothy
Gao, Xin
author_facet Fan, Ming
Wong, Ka-Chun
Ryu, Taewoo
Ravasi, Timothy
Gao, Xin
author_sort Fan, Ming
collection PubMed
description With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx.
format Online
Article
Text
id pubmed-3386278
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-33862782012-07-03 SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification Fan, Ming Wong, Ka-Chun Ryu, Taewoo Ravasi, Timothy Gao, Xin PLoS One Research Article With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx. Public Library of Science 2012-06-28 /pmc/articles/PMC3386278/ /pubmed/22761802 http://dx.doi.org/10.1371/journal.pone.0039475 Text en Fan et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Fan, Ming
Wong, Ka-Chun
Ryu, Taewoo
Ravasi, Timothy
Gao, Xin
SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification
title SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification
title_full SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification
title_fullStr SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification
title_full_unstemmed SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification
title_short SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification
title_sort secom: a novel hash seed and community detection based-approach for genome-scale protein domain identification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3386278/
https://www.ncbi.nlm.nih.gov/pubmed/22761802
http://dx.doi.org/10.1371/journal.pone.0039475
work_keys_str_mv AT fanming secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification
AT wongkachun secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification
AT ryutaewoo secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification
AT ravasitimothy secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification
AT gaoxin secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification