Cargando…

Predicting enzymatic function of protein sequences with attention

MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alle...

Descripción completa

Detalles Bibliográficos
Autores principales:	Buton, Nicolas, Coste, François, Le Cunff, Yann
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612403/ https://www.ncbi.nlm.nih.gov/pubmed/37874958 http://dx.doi.org/10.1093/bioinformatics/btad620

_version_	1785128696784355328
author	Buton, Nicolas Coste, François Le Cunff, Yann
author_facet	Buton, Nicolas Coste, François Le Cunff, Yann
author_sort	Buton, Nicolas
collection	PubMed
description	MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes. RESULTS: We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910
format	Online Article Text
id	pubmed-10612403
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-106124032023-10-29 Predicting enzymatic function of protein sequences with attention Buton, Nicolas Coste, François Le Cunff, Yann Bioinformatics Original Paper MOTIVATION: There is a growing number of available protein sequences, but only a limited amount has been manually annotated. For example, only 0.25% of all entries of UniProtKB are reviewed by human annotators. Further developing automatic tools to infer protein function from sequence alone can alleviate part of this gap. In this article, we investigate the potential of Transformer deep neural networks on a specific case of functional sequence annotation: the prediction of enzymatic classes. RESULTS: We show that our EnzBert transformer models, trained to predict Enzyme Commission (EC) numbers by specialization of a protein language model, outperforms state-of-the-art tools for monofunctional enzyme class prediction based on sequences only. Accuracy is improved from 84% to 95% on the prediction of EC numbers at level two on the EC40 benchmark. To evaluate the prediction quality at level four, the most detailed level of EC numbers, we built two new time-based benchmarks for comparison with state-of-the-art methods ECPred and DeepEC: the macro-F1 score is respectively improved from 41% to 54% and from 20% to 26%. Finally, we also show that using a simple combination of attention maps is on par with, or better than, other classical interpretability methods on the EC prediction task. More specifically, important residues identified by attention maps tend to correspond to known catalytic sites. Quantitatively, we report a max F-Gain score of 96.05%, while classical interpretability methods reach 91.44% at best. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are respectively available at https://gitlab.inria.fr/nbuton/tfpc and https://doi.org/10.5281/zenodo.7253910 Oxford University Press 2023-10-24 /pmc/articles/PMC10612403/ /pubmed/37874958 http://dx.doi.org/10.1093/bioinformatics/btad620 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Buton, Nicolas Coste, François Le Cunff, Yann Predicting enzymatic function of protein sequences with attention
title	Predicting enzymatic function of protein sequences with attention
title_full	Predicting enzymatic function of protein sequences with attention
title_fullStr	Predicting enzymatic function of protein sequences with attention
title_full_unstemmed	Predicting enzymatic function of protein sequences with attention
title_short	Predicting enzymatic function of protein sequences with attention
title_sort	predicting enzymatic function of protein sequences with attention
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612403/ https://www.ncbi.nlm.nih.gov/pubmed/37874958 http://dx.doi.org/10.1093/bioinformatics/btad620
work_keys_str_mv	AT butonnicolas predictingenzymaticfunctionofproteinsequenceswithattention AT costefrancois predictingenzymaticfunctionofproteinsequenceswithattention AT lecunffyann predictingenzymaticfunctionofproteinsequenceswithattention

Predicting enzymatic function of protein sequences with attention

Ejemplares similares