Cargando…

Survey of Protein Sequence Embedding Models

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b...

Descripción completa

Detalles Bibliográficos
Autores principales: Tran, Chau, Khadkikar, Siddharth, Porollo, Aleksey
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9963412/
https://www.ncbi.nlm.nih.gov/pubmed/36835188
http://dx.doi.org/10.3390/ijms24043775
_version_ 1784896247286464512
author Tran, Chau
Khadkikar, Siddharth
Porollo, Aleksey
author_facet Tran, Chau
Khadkikar, Siddharth
Porollo, Aleksey
author_sort Tran, Chau
collection PubMed
description Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).
format Online
Article
Text
id pubmed-9963412
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-99634122023-02-26 Survey of Protein Sequence Embedding Models Tran, Chau Khadkikar, Siddharth Porollo, Aleksey Int J Mol Sci Article Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC). MDPI 2023-02-14 /pmc/articles/PMC9963412/ /pubmed/36835188 http://dx.doi.org/10.3390/ijms24043775 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Tran, Chau
Khadkikar, Siddharth
Porollo, Aleksey
Survey of Protein Sequence Embedding Models
title Survey of Protein Sequence Embedding Models
title_full Survey of Protein Sequence Embedding Models
title_fullStr Survey of Protein Sequence Embedding Models
title_full_unstemmed Survey of Protein Sequence Embedding Models
title_short Survey of Protein Sequence Embedding Models
title_sort survey of protein sequence embedding models
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9963412/
https://www.ncbi.nlm.nih.gov/pubmed/36835188
http://dx.doi.org/10.3390/ijms24043775
work_keys_str_mv AT tranchau surveyofproteinsequenceembeddingmodels
AT khadkikarsiddharth surveyofproteinsequenceembeddingmodels
AT porolloaleksey surveyofproteinsequenceembeddingmodels