Cargando…

Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction

The repertoire of T cell receptors encodes various types of immunological information. Machine learning is indispensable for decoding such information from repertoire datasets measured by next-generation sequencing (NGS). In particular, the classification of repertoires is the most basic task, which...

Descripción completa

Detalles Bibliográficos
Autores principales: Katayama, Yotaro, Kobayashi, Tetsuya J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9346074/
https://www.ncbi.nlm.nih.gov/pubmed/35936014
http://dx.doi.org/10.3389/fimmu.2022.797640
_version_ 1784761566388813824
author Katayama, Yotaro
Kobayashi, Tetsuya J.
author_facet Katayama, Yotaro
Kobayashi, Tetsuya J.
author_sort Katayama, Yotaro
collection PubMed
description The repertoire of T cell receptors encodes various types of immunological information. Machine learning is indispensable for decoding such information from repertoire datasets measured by next-generation sequencing (NGS). In particular, the classification of repertoires is the most basic task, which is relevant for a variety of scientific and clinical problems. Supported by the recent appearance of large datasets, efficient but data-expensive methods have been proposed. However, it is unclear whether they can work efficiently when the available sample size is severely restricted as in practical situations. In this study, we demonstrate that their performances can be impaired substantially below critical sample sizes. To complement this drawback, we propose MotifBoost, which exploits the information of short k-mer motifs of TCRs. MotifBoost can perform the classification as efficiently as a deep learning method on large datasets while providing more stable and reliable results on small datasets. We tested MotifBoost on the four small datasets which consist of various conditions such as Cytomegalovirus (CMV), HIV, α-chain, β-chain and it consistently preserved the stability. We also clarify that the robustness of MotifBoost can be attributed to the efficiency of k-mer motifs as representation features of repertoires. Finally, by comparing the predictions of these methods, we show that the whole sequence identity and sequence motifs encode partially different information and that a combination of such complementary information is necessary for further development of repertoire analysis.
format Online
Article
Text
id pubmed-9346074
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-93460742022-08-04 Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction Katayama, Yotaro Kobayashi, Tetsuya J. Front Immunol Immunology The repertoire of T cell receptors encodes various types of immunological information. Machine learning is indispensable for decoding such information from repertoire datasets measured by next-generation sequencing (NGS). In particular, the classification of repertoires is the most basic task, which is relevant for a variety of scientific and clinical problems. Supported by the recent appearance of large datasets, efficient but data-expensive methods have been proposed. However, it is unclear whether they can work efficiently when the available sample size is severely restricted as in practical situations. In this study, we demonstrate that their performances can be impaired substantially below critical sample sizes. To complement this drawback, we propose MotifBoost, which exploits the information of short k-mer motifs of TCRs. MotifBoost can perform the classification as efficiently as a deep learning method on large datasets while providing more stable and reliable results on small datasets. We tested MotifBoost on the four small datasets which consist of various conditions such as Cytomegalovirus (CMV), HIV, α-chain, β-chain and it consistently preserved the stability. We also clarify that the robustness of MotifBoost can be attributed to the efficiency of k-mer motifs as representation features of repertoires. Finally, by comparing the predictions of these methods, we show that the whole sequence identity and sequence motifs encode partially different information and that a combination of such complementary information is necessary for further development of repertoire analysis. Frontiers Media S.A. 2022-07-20 /pmc/articles/PMC9346074/ /pubmed/35936014 http://dx.doi.org/10.3389/fimmu.2022.797640 Text en Copyright © 2022 Katayama and Kobayashi https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Immunology
Katayama, Yotaro
Kobayashi, Tetsuya J.
Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction
title Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction
title_full Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction
title_fullStr Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction
title_full_unstemmed Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction
title_short Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of k -mer Feature Extraction
title_sort comparative study of repertoire classification methods reveals data efficiency of k -mer feature extraction
topic Immunology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9346074/
https://www.ncbi.nlm.nih.gov/pubmed/35936014
http://dx.doi.org/10.3389/fimmu.2022.797640
work_keys_str_mv AT katayamayotaro comparativestudyofrepertoireclassificationmethodsrevealsdataefficiencyofkmerfeatureextraction
AT kobayashitetsuyaj comparativestudyofrepertoireclassificationmethodsrevealsdataefficiencyofkmerfeatureextraction