Cargando…

A comprehensive software suite for protein family construction and functional site prediction

In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Backg...

Descripción completa

Detalles Bibliográficos
Autores principales: Haft, David Renfrew, Haft, Daniel H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5300114/
https://www.ncbi.nlm.nih.gov/pubmed/28182651
http://dx.doi.org/10.1371/journal.pone.0171758
_version_ 1782506127149236224
author Haft, David Renfrew
Haft, Daniel H.
author_facet Haft, David Renfrew
Haft, Daniel H.
author_sort Haft, David Renfrew
collection PubMed
description In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user’s query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions.
format Online
Article
Text
id pubmed-5300114
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-53001142017-02-28 A comprehensive software suite for protein family construction and functional site prediction Haft, David Renfrew Haft, Daniel H. PLoS One Research Article In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user’s query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions. Public Library of Science 2017-02-09 /pmc/articles/PMC5300114/ /pubmed/28182651 http://dx.doi.org/10.1371/journal.pone.0171758 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle Research Article
Haft, David Renfrew
Haft, Daniel H.
A comprehensive software suite for protein family construction and functional site prediction
title A comprehensive software suite for protein family construction and functional site prediction
title_full A comprehensive software suite for protein family construction and functional site prediction
title_fullStr A comprehensive software suite for protein family construction and functional site prediction
title_full_unstemmed A comprehensive software suite for protein family construction and functional site prediction
title_short A comprehensive software suite for protein family construction and functional site prediction
title_sort comprehensive software suite for protein family construction and functional site prediction
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5300114/
https://www.ncbi.nlm.nih.gov/pubmed/28182651
http://dx.doi.org/10.1371/journal.pone.0171758
work_keys_str_mv AT haftdavidrenfrew acomprehensivesoftwaresuiteforproteinfamilyconstructionandfunctionalsiteprediction
AT haftdanielh acomprehensivesoftwaresuiteforproteinfamilyconstructionandfunctionalsiteprediction
AT haftdavidrenfrew comprehensivesoftwaresuiteforproteinfamilyconstructionandfunctionalsiteprediction
AT haftdanielh comprehensivesoftwaresuiteforproteinfamilyconstructionandfunctionalsiteprediction