Cargando…

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search...

Descripción completa

Detalles Bibliográficos
Autores principales: Melvin, Iain, Weston, Jason, Noble, William Stafford, Leslie, Christina
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3029239/
https://www.ncbi.nlm.nih.gov/pubmed/21298082
http://dx.doi.org/10.1371/journal.pcbi.1001047
_version_ 1782197208763858944
author Melvin, Iain
Weston, Jason
Noble, William Stafford
Leslie, Christina
author_facet Melvin, Iain
Weston, Jason
Noble, William Stafford
Leslie, Christina
author_sort Melvin, Iain
collection PubMed
description Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.
format Text
id pubmed-3029239
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-30292392011-02-04 Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Melvin, Iain Weston, Jason Noble, William Stafford Leslie, Christina PLoS Comput Biol Research Article Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space. Public Library of Science 2011-01-27 /pmc/articles/PMC3029239/ /pubmed/21298082 http://dx.doi.org/10.1371/journal.pcbi.1001047 Text en Melvin et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Melvin, Iain
Weston, Jason
Noble, William Stafford
Leslie, Christina
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding
title Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding
title_full Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding
title_fullStr Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding
title_full_unstemmed Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding
title_short Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding
title_sort detecting remote evolutionary relationships among proteins by large-scale semantic embedding
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3029239/
https://www.ncbi.nlm.nih.gov/pubmed/21298082
http://dx.doi.org/10.1371/journal.pcbi.1001047
work_keys_str_mv AT melviniain detectingremoteevolutionaryrelationshipsamongproteinsbylargescalesemanticembedding
AT westonjason detectingremoteevolutionaryrelationshipsamongproteinsbylargescalesemanticembedding
AT noblewilliamstafford detectingremoteevolutionaryrelationshipsamongproteinsbylargescalesemanticembedding
AT lesliechristina detectingremoteevolutionaryrelationshipsamongproteinsbylargescalesemanticembedding