Cargando…
Different Recognition of Protein Features Depending on Deep Learning Models: A Case Study of Aromatic Decarboxylase UbiD
SIMPLE SUMMARY: Various protein sequences are registered in biological databases, and hundreds of the sequences have recently been sequenced by way of next-generation sequencing, and then the number of sequences with unknown functions is explosively increasing. To efficiently determine the annotatio...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10295083/ https://www.ncbi.nlm.nih.gov/pubmed/37372080 http://dx.doi.org/10.3390/biology12060795 |
Sumario: | SIMPLE SUMMARY: Various protein sequences are registered in biological databases, and hundreds of the sequences have recently been sequenced by way of next-generation sequencing, and then the number of sequences with unknown functions is explosively increasing. To efficiently determine the annotations, new feature extraction of protein sequences that is different from existing knowledge is required. Deep learning can extract various features based on training data. Many studies have reported deep learning models with high accuracy for predicting protein annotations; however, in the reports, which amino acid sites in protein are important for the prediction of the annotations have not been discussed among multiple deep learning models. Here, 3 deep learning models for the prediction of the proteins included in a protein family were analyzed using an explainable artificial intelligence method to explore important protein features. The models regarded different sites as important for each model, and all models also recognize different amino acids from the secondary structure, conserved regions and active sites as important features. These results suggest that the models can interpret protein sequences through different perspectives from existing knowledge. ABSTRACT: The number of unannotated protein sequences is explosively increasing due to genome sequence technology. A more comprehensive understanding of protein functions for protein annotation requires the discovery of new features that cannot be captured from conventional methods. Deep learning can extract important features from input data and predict protein functions based on the features. Here, protein feature vectors generated by 3 deep learning models are analyzed using Integrated Gradients to explore important features of amino acid sites. As a case study, prediction and feature extraction models for UbiD enzymes were built using these models. The important amino acid residues extracted from the models were different from secondary structures, conserved regions and active sites of known UbiD information. Interestingly, the different amino acid residues within UbiD sequences were regarded as important factors depending on the type of models and sequences. The Transformer models focused on more specific regions than the other models. These results suggest that each deep learning model understands protein features with different aspects from existing knowledge and has the potential to discover new laws of protein functions. This study will help to extract new protein features for the other protein annotations. |
---|