Cargando…
Predicting enzymatic function of protein sequences with attention
MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alle...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612403/ https://www.ncbi.nlm.nih.gov/pubmed/37874958 http://dx.doi.org/10.1093/bioinformatics/btad620 |
_version_ | 1785128696784355328 |
---|---|
author | Buton, Nicolas Coste, François Le Cunff, Yann |
author_facet | Buton, Nicolas Coste, François Le Cunff, Yann |
author_sort | Buton, Nicolas |
collection | PubMed |
description | MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes. RESULTS: We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910 |
format | Online Article Text |
id | pubmed-10612403 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-106124032023-10-29 Predicting enzymatic function of protein sequences with attention Buton, Nicolas Coste, François Le Cunff, Yann Bioinformatics Original Paper MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes. RESULTS: We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910 Oxford University Press 2023-10-24 /pmc/articles/PMC10612403/ /pubmed/37874958 http://dx.doi.org/10.1093/bioinformatics/btad620 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Buton, Nicolas Coste, François Le Cunff, Yann Predicting enzymatic function of protein sequences with attention |
title | Predicting enzymatic function of protein sequences with attention |
title_full | Predicting enzymatic function of protein sequences with attention |
title_fullStr | Predicting enzymatic function of protein sequences with attention |
title_full_unstemmed | Predicting enzymatic function of protein sequences with attention |
title_short | Predicting enzymatic function of protein sequences with attention |
title_sort | predicting enzymatic function of protein sequences with attention |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612403/ https://www.ncbi.nlm.nih.gov/pubmed/37874958 http://dx.doi.org/10.1093/bioinformatics/btad620 |
work_keys_str_mv | AT butonnicolas predictingenzymaticfunctionofproteinsequenceswithattention AT costefrancois predictingenzymaticfunctionofproteinsequenceswithattention AT lecunffyann predictingenzymaticfunctionofproteinsequenceswithattention |