Cargando…
Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
[Image: see text] Epitopes are specific regions on an antigen’s surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is cru...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Chemical Society
2023
|
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10594869/ https://www.ncbi.nlm.nih.gov/pubmed/37703075 http://dx.doi.org/10.1021/acssynbio.3c00201 |
_version_ | 1785124744166637568 |
---|---|
author | Lopez-Martinez, Elena Manteca, Aitor Ferruz, Noelia Cortajarena, Aitziber L. |
author_facet | Lopez-Martinez, Elena Manteca, Aitor Ferruz, Noelia Cortajarena, Aitziber L. |
author_sort | Lopez-Martinez, Elena |
collection | PubMed |
description | [Image: see text] Epitopes are specific regions on an antigen’s surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude. |
format | Online Article Text |
id | pubmed-10594869 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | American Chemical Society |
record_format | MEDLINE/PubMed |
spelling | pubmed-105948692023-10-25 Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries Lopez-Martinez, Elena Manteca, Aitor Ferruz, Noelia Cortajarena, Aitziber L. ACS Synth Biol [Image: see text] Epitopes are specific regions on an antigen’s surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude. American Chemical Society 2023-09-13 /pmc/articles/PMC10594869/ /pubmed/37703075 http://dx.doi.org/10.1021/acssynbio.3c00201 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Lopez-Martinez, Elena Manteca, Aitor Ferruz, Noelia Cortajarena, Aitziber L. Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries |
title | Statistical Analysis
and Tokenization of Epitopes
to Construct Artificial Neoepitope Libraries |
title_full | Statistical Analysis
and Tokenization of Epitopes
to Construct Artificial Neoepitope Libraries |
title_fullStr | Statistical Analysis
and Tokenization of Epitopes
to Construct Artificial Neoepitope Libraries |
title_full_unstemmed | Statistical Analysis
and Tokenization of Epitopes
to Construct Artificial Neoepitope Libraries |
title_short | Statistical Analysis
and Tokenization of Epitopes
to Construct Artificial Neoepitope Libraries |
title_sort | statistical analysis
and tokenization of epitopes
to construct artificial neoepitope libraries |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10594869/ https://www.ncbi.nlm.nih.gov/pubmed/37703075 http://dx.doi.org/10.1021/acssynbio.3c00201 |
work_keys_str_mv | AT lopezmartinezelena statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries AT mantecaaitor statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries AT ferruznoelia statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries AT cortajarenaaitziberl statisticalanalysisandtokenizationofepitopestoconstructartificialneoepitopelibraries |