Cargando…

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task....

Descripción completa

Detalles Bibliográficos
Autores principales: Villegas-Morcillo, Amelia, Makrodimitris, Stavros, van Ham, Roeland C H J, Gomez, Angel M, Sanchez, Victoria, Reinders, Marcel J T
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055213/
https://www.ncbi.nlm.nih.gov/pubmed/32797179
http://dx.doi.org/10.1093/bioinformatics/btaa701
_version_ 1783680409609961472
author Villegas-Morcillo, Amelia
Makrodimitris, Stavros
van Ham, Roeland C H J
Gomez, Angel M
Sanchez, Victoria
Reinders, Marcel J T
author_facet Villegas-Morcillo, Amelia
Makrodimitris, Stavros
van Ham, Roeland C H J
Gomez, Angel M
Sanchez, Victoria
Reinders, Marcel J T
author_sort Villegas-Morcillo, Amelia
collection PubMed
description MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8055213
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-80552132021-04-28 Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function Villegas-Morcillo, Amelia Makrodimitris, Stavros van Ham, Roeland C H J Gomez, Angel M Sanchez, Victoria Reinders, Marcel J T Bioinformatics Original Papers MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-08-14 /pmc/articles/PMC8055213/ /pubmed/32797179 http://dx.doi.org/10.1093/bioinformatics/btaa701 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Villegas-Morcillo, Amelia
Makrodimitris, Stavros
van Ham, Roeland C H J
Gomez, Angel M
Sanchez, Victoria
Reinders, Marcel J T
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
title Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
title_full Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
title_fullStr Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
title_full_unstemmed Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
title_short Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
title_sort unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055213/
https://www.ncbi.nlm.nih.gov/pubmed/32797179
http://dx.doi.org/10.1093/bioinformatics/btaa701
work_keys_str_mv AT villegasmorcilloamelia unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction
AT makrodimitrisstavros unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction
AT vanhamroelandchj unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction
AT gomezangelm unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction
AT sanchezvictoria unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction
AT reindersmarceljt unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction