Cargando…
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task....
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055213/ https://www.ncbi.nlm.nih.gov/pubmed/32797179 http://dx.doi.org/10.1093/bioinformatics/btaa701 |
_version_ | 1783680409609961472 |
---|---|
author | Villegas-Morcillo, Amelia Makrodimitris, Stavros van Ham, Roeland C H J Gomez, Angel M Sanchez, Victoria Reinders, Marcel J T |
author_facet | Villegas-Morcillo, Amelia Makrodimitris, Stavros van Ham, Roeland C H J Gomez, Angel M Sanchez, Victoria Reinders, Marcel J T |
author_sort | Villegas-Morcillo, Amelia |
collection | PubMed |
description | MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-8055213 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-80552132021-04-28 Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function Villegas-Morcillo, Amelia Makrodimitris, Stavros van Ham, Roeland C H J Gomez, Angel M Sanchez, Victoria Reinders, Marcel J T Bioinformatics Original Papers MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-08-14 /pmc/articles/PMC8055213/ /pubmed/32797179 http://dx.doi.org/10.1093/bioinformatics/btaa701 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Original Papers Villegas-Morcillo, Amelia Makrodimitris, Stavros van Ham, Roeland C H J Gomez, Angel M Sanchez, Victoria Reinders, Marcel J T Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
title | Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
title_full | Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
title_fullStr | Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
title_full_unstemmed | Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
title_short | Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
title_sort | unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055213/ https://www.ncbi.nlm.nih.gov/pubmed/32797179 http://dx.doi.org/10.1093/bioinformatics/btaa701 |
work_keys_str_mv | AT villegasmorcilloamelia unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction AT makrodimitrisstavros unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction AT vanhamroelandchj unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction AT gomezangelm unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction AT sanchezvictoria unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction AT reindersmarceljt unsupervisedproteinembeddingsoutperformhandcraftedsequenceandstructurefeaturesatpredictingmolecularfunction |