Cargando…

Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries

[Image: see text] Epitopes are specific regions on an antigen’s surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is cru...

Descripción completa

Detalles Bibliográficos
Autores principales: Lopez-Martinez, Elena, Manteca, Aitor, Ferruz, Noelia, Cortajarena, Aitziber L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10594869/
https://www.ncbi.nlm.nih.gov/pubmed/37703075
http://dx.doi.org/10.1021/acssynbio.3c00201
_version_ 1785124744166637568
author Lopez-Martinez, Elena
Manteca, Aitor
Ferruz, Noelia
Cortajarena, Aitziber L.
author_facet Lopez-Martinez, Elena
Manteca, Aitor
Ferruz, Noelia
Cortajarena, Aitziber L.
author_sort Lopez-Martinez, Elena
collection PubMed
description [Image: see text] Epitopes are specific regions on an antigen’s surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude.
format Online
Article
Text
id pubmed-10594869
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-105948692023-10-25 Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries Lopez-Martinez, Elena Manteca, Aitor Ferruz, Noelia Cortajarena, Aitziber L. ACS Synth Biol [Image: see text] Epitopes are specific regions on an antigen’s surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude. American Chemical Society 2023-09-13 /pmc/articles/PMC10594869/ /pubmed/37703075 http://dx.doi.org/10.1021/acssynbio.3c00201 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Lopez-Martinez, Elena
Manteca, Aitor
Ferruz, Noelia
Cortajarena, Aitziber L.
Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
title Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
title_full Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
title_fullStr Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
title_full_unstemmed Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
title_short Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
title_sort statistical analysis and tokenization of epitopes to construct artificial neoepitope libraries
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10594869/
https://www.ncbi.nlm.nih.gov/pubmed/37703075
http://dx.doi.org/10.1021/acssynbio.3c00201
work_keys_str_mv AT lopezmartinezelena statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries
AT mantecaaitor statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries
AT ferruznoelia statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries
AT cortajarenaaitziberl statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries