Cargando…

Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space

BACKGROUND: Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-b...

Descripción completa

Detalles Bibliográficos
Autores principales: Molloy, Kevin, Van, M Jennifer, Barbara, Daniel, Shehu, Amarda
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120149/
https://www.ncbi.nlm.nih.gov/pubmed/25080993
http://dx.doi.org/10.1186/1471-2105-15-S8-S4
_version_ 1782329046533668864
author Molloy, Kevin
Van, M Jennifer
Barbara, Daniel
Shehu, Amarda
author_facet Molloy, Kevin
Van, M Jennifer
Barbara, Daniel
Shehu, Amarda
author_sort Molloy, Kevin
collection PubMed
description BACKGROUND: Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space. METHODS: Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain. RESULTS: We show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership. CONCLUSIONS: This work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools.
format Online
Article
Text
id pubmed-4120149
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-41201492014-08-11 Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space Molloy, Kevin Van, M Jennifer Barbara, Daniel Shehu, Amarda BMC Bioinformatics Research BACKGROUND: Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space. METHODS: Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain. RESULTS: We show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership. CONCLUSIONS: This work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools. BioMed Central 2014-07-14 /pmc/articles/PMC4120149/ /pubmed/25080993 http://dx.doi.org/10.1186/1471-2105-15-S8-S4 Text en Copyright © 2014 Molloy et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Molloy, Kevin
Van, M Jennifer
Barbara, Daniel
Shehu, Amarda
Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
title Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
title_full Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
title_fullStr Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
title_full_unstemmed Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
title_short Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
title_sort exploring representations of protein structure for automated remote homology detection and mapping of protein structure space
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120149/
https://www.ncbi.nlm.nih.gov/pubmed/25080993
http://dx.doi.org/10.1186/1471-2105-15-S8-S4
work_keys_str_mv AT molloykevin exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace
AT vanmjennifer exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace
AT barbaradaniel exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace
AT shehuamarda exploringrepresentationsofproteinstructureforautomatedremotehomologydetectionandmappingofproteinstructurespace