Cargando…

Predicting enzymatic function of protein sequences with attention

MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alle...

Descripción completa

Detalles Bibliográficos
Autores principales: Buton, Nicolas, Coste, François, Le Cunff, Yann
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612403/
https://www.ncbi.nlm.nih.gov/pubmed/37874958
http://dx.doi.org/10.1093/bioinformatics/btad620
_version_ 1785128696784355328
author Buton, Nicolas
Coste, François
Le Cunff, Yann
author_facet Buton, Nicolas
Coste, François
Le Cunff, Yann
author_sort Buton, Nicolas
collection PubMed
description MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes. RESULTS: We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910
format Online
Article
Text
id pubmed-10612403
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-106124032023-10-29 Predicting enzymatic function of protein sequences with attention Buton, Nicolas Coste, François Le Cunff, Yann Bioinformatics Original Paper MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes. RESULTS: We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910 Oxford University Press 2023-10-24 /pmc/articles/PMC10612403/ /pubmed/37874958 http://dx.doi.org/10.1093/bioinformatics/btad620 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Buton, Nicolas
Coste, François
Le Cunff, Yann
Predicting enzymatic function of protein sequences with attention
title Predicting enzymatic function of protein sequences with attention
title_full Predicting enzymatic function of protein sequences with attention
title_fullStr Predicting enzymatic function of protein sequences with attention
title_full_unstemmed Predicting enzymatic function of protein sequences with attention
title_short Predicting enzymatic function of protein sequences with attention
title_sort predicting enzymatic function of protein sequences with attention
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612403/
https://www.ncbi.nlm.nih.gov/pubmed/37874958
http://dx.doi.org/10.1093/bioinformatics/btad620
work_keys_str_mv AT butonnicolas predictingenzymaticfunctionofproteinsequenceswithattention
AT costefrancois predictingenzymaticfunctionofproteinsequenceswithattention
AT lecunffyann predictingenzymaticfunctionofproteinsequenceswithattention