Cargando…
Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
BACKGROUND: Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-b...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120149/ https://www.ncbi.nlm.nih.gov/pubmed/25080993 http://dx.doi.org/10.1186/1471-2105-15-S8-S4 |
_version_ | 1782329046533668864 |
---|---|
author | Molloy, Kevin Van, M Jennifer Barbara, Daniel Shehu, Amarda |
author_facet | Molloy, Kevin Van, M Jennifer Barbara, Daniel Shehu, Amarda |
author_sort | Molloy, Kevin |
collection | PubMed |
description | BACKGROUND: Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space. METHODS: Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain. RESULTS: We show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership. CONCLUSIONS: This work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools. |
format | Online Article Text |
id | pubmed-4120149 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-41201492014-08-11 Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space Molloy, Kevin Van, M Jennifer Barbara, Daniel Shehu, Amarda BMC Bioinformatics Research BACKGROUND: Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space. METHODS: Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain. RESULTS: We show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership. CONCLUSIONS: This work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools. BioMed Central 2014-07-14 /pmc/articles/PMC4120149/ /pubmed/25080993 http://dx.doi.org/10.1186/1471-2105-15-S8-S4 Text en Copyright © 2014 Molloy et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Molloy, Kevin Van, M Jennifer Barbara, Daniel Shehu, Amarda Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space |
title | Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space |
title_full | Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space |
title_fullStr | Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space |
title_full_unstemmed | Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space |
title_short | Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space |
title_sort | exploring representations of protein structure for automated remote homology detection and mapping of protein structure space |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120149/ https://www.ncbi.nlm.nih.gov/pubmed/25080993 http://dx.doi.org/10.1186/1471-2105-15-S8-S4 |
work_keys_str_mv | AT molloykevin exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace AT vanmjennifer exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace AT barbaradaniel exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace AT shehuamarda exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace |