Cargando…

Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing

The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the c...

Descripción completa

Detalles Bibliográficos
Autores principales: Ostrovsky-Berman, Miri, Frankel, Boaz, Polak, Pazit, Yaari, Gur
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8340020/
https://www.ncbi.nlm.nih.gov/pubmed/34367141
http://dx.doi.org/10.3389/fimmu.2021.680687
_version_ 1783733718800662528
author Ostrovsky-Berman, Miri
Frankel, Boaz
Polak, Pazit
Yaari, Gur
author_facet Ostrovsky-Berman, Miri
Frankel, Boaz
Polak, Pazit
Yaari, Gur
author_sort Ostrovsky-Berman, Miri
collection PubMed
description The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.
format Online
Article
Text
id pubmed-8340020
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-83400202021-08-06 Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing Ostrovsky-Berman, Miri Frankel, Boaz Polak, Pazit Yaari, Gur Front Immunol Immunology The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis. Frontiers Media S.A. 2021-07-22 /pmc/articles/PMC8340020/ /pubmed/34367141 http://dx.doi.org/10.3389/fimmu.2021.680687 Text en Copyright © 2021 Ostrovsky-Berman, Frankel, Polak and Yaari https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Immunology
Ostrovsky-Berman, Miri
Frankel, Boaz
Polak, Pazit
Yaari, Gur
Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing
title Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing
title_full Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing
title_fullStr Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing
title_full_unstemmed Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing
title_short Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ(N) Using Natural Language Processing
title_sort immune2vec: embedding b/t cell receptor sequences in ℝ(n) using natural language processing
topic Immunology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8340020/
https://www.ncbi.nlm.nih.gov/pubmed/34367141
http://dx.doi.org/10.3389/fimmu.2021.680687
work_keys_str_mv AT ostrovskybermanmiri immune2vecembeddingbtcellreceptorsequencesinrnusingnaturallanguageprocessing
AT frankelboaz immune2vecembeddingbtcellreceptorsequencesinrnusingnaturallanguageprocessing
AT polakpazit immune2vecembeddingbtcellreceptorsequencesinrnusingnaturallanguageprocessing
AT yaarigur immune2vecembeddingbtcellreceptorsequencesinrnusingnaturallanguageprocessing