Cargando…
Survey of Protein Sequence Embedding Models
Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9963412/ https://www.ncbi.nlm.nih.gov/pubmed/36835188 http://dx.doi.org/10.3390/ijms24043775 |
_version_ | 1784896247286464512 |
---|---|
author | Tran, Chau Khadkikar, Siddharth Porollo, Aleksey |
author_facet | Tran, Chau Khadkikar, Siddharth Porollo, Aleksey |
author_sort | Tran, Chau |
collection | PubMed |
description | Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC). |
format | Online Article Text |
id | pubmed-9963412 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-99634122023-02-26 Survey of Protein Sequence Embedding Models Tran, Chau Khadkikar, Siddharth Porollo, Aleksey Int J Mol Sci Article Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC). MDPI 2023-02-14 /pmc/articles/PMC9963412/ /pubmed/36835188 http://dx.doi.org/10.3390/ijms24043775 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Tran, Chau Khadkikar, Siddharth Porollo, Aleksey Survey of Protein Sequence Embedding Models |
title | Survey of Protein Sequence Embedding Models |
title_full | Survey of Protein Sequence Embedding Models |
title_fullStr | Survey of Protein Sequence Embedding Models |
title_full_unstemmed | Survey of Protein Sequence Embedding Models |
title_short | Survey of Protein Sequence Embedding Models |
title_sort | survey of protein sequence embedding models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9963412/ https://www.ncbi.nlm.nih.gov/pubmed/36835188 http://dx.doi.org/10.3390/ijms24043775 |
work_keys_str_mv | AT tranchau surveyofproteinsequenceembeddingmodels AT khadkikarsiddharth surveyofproteinsequenceembeddingmodels AT porolloaleksey surveyofproteinsequenceembeddingmodels |