Cargando…

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification

BACKGROUND: Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penali...

Descripción completa

Detalles Bibliográficos
Autores principales: Kanduri, Chakravarthi, Pavlović, Milena, Scheffer, Lonneke, Motwani, Keshav, Chernigovskaya, Maria, Greiff, Victor, Sandve, Geir K
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9154052/
https://www.ncbi.nlm.nih.gov/pubmed/35639633
http://dx.doi.org/10.1093/gigascience/giac046
_version_ 1784717958696665088
author Kanduri, Chakravarthi
Pavlović, Milena
Scheffer, Lonneke
Motwani, Keshav
Chernigovskaya, Maria
Greiff, Victor
Sandve, Geir K
author_facet Kanduri, Chakravarthi
Pavlović, Milena
Scheffer, Lonneke
Motwani, Keshav
Chernigovskaya, Maria
Greiff, Victor
Sandve, Geir K
author_sort Kanduri, Chakravarthi
collection PubMed
description BACKGROUND: Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required. RESULTS: To identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state–associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences. CONCLUSIONS: We provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods.
format Online
Article
Text
id pubmed-9154052
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-91540522022-06-04 Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification Kanduri, Chakravarthi Pavlović, Milena Scheffer, Lonneke Motwani, Keshav Chernigovskaya, Maria Greiff, Victor Sandve, Geir K Gigascience Research BACKGROUND: Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required. RESULTS: To identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state–associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences. CONCLUSIONS: We provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods. Oxford University Press 2022-05-25 /pmc/articles/PMC9154052/ /pubmed/35639633 http://dx.doi.org/10.1093/gigascience/giac046 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Kanduri, Chakravarthi
Pavlović, Milena
Scheffer, Lonneke
Motwani, Keshav
Chernigovskaya, Maria
Greiff, Victor
Sandve, Geir K
Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_full Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_fullStr Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_full_unstemmed Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_short Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_sort profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9154052/
https://www.ncbi.nlm.nih.gov/pubmed/35639633
http://dx.doi.org/10.1093/gigascience/giac046
work_keys_str_mv AT kandurichakravarthi profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification
AT pavlovicmilena profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification
AT schefferlonneke profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification
AT motwanikeshav profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification
AT chernigovskayamaria profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification
AT greiffvictor profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification
AT sandvegeirk profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification