Cargando…

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification

BACKGROUND: Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penali...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kanduri, Chakravarthi, Pavlović, Milena, Scheffer, Lonneke, Motwani, Keshav, Chernigovskaya, Maria, Greiff, Victor, Sandve, Geir K
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9154052/ https://www.ncbi.nlm.nih.gov/pubmed/35639633 http://dx.doi.org/10.1093/gigascience/giac046

_version_	1784717958696665088
author	Kanduri, Chakravarthi Pavlović, Milena Scheffer, Lonneke Motwani, Keshav Chernigovskaya, Maria Greiff, Victor Sandve, Geir K
author_facet	Kanduri, Chakravarthi Pavlović, Milena Scheffer, Lonneke Motwani, Keshav Chernigovskaya, Maria Greiff, Victor Sandve, Geir K
author_sort	Kanduri, Chakravarthi
collection	PubMed
description	BACKGROUND: Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required. RESULTS: To identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state–associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences. CONCLUSIONS: We provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods.
format	Online Article Text
id	pubmed-9154052
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-91540522022-06-04 Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification Kanduri, Chakravarthi Pavlović, Milena Scheffer, Lonneke Motwani, Keshav Chernigovskaya, Maria Greiff, Victor Sandve, Geir K Gigascience Research BACKGROUND: Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required. RESULTS: To identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state–associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences. CONCLUSIONS: We provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods. Oxford University Press 2022-05-25 /pmc/articles/PMC9154052/ /pubmed/35639633 http://dx.doi.org/10.1093/gigascience/giac046 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Kanduri, Chakravarthi Pavlović, Milena Scheffer, Lonneke Motwani, Keshav Chernigovskaya, Maria Greiff, Victor Sandve, Geir K Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title	Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_full	Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_fullStr	Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_full_unstemmed	Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_short	Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
title_sort	profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9154052/ https://www.ncbi.nlm.nih.gov/pubmed/35639633 http://dx.doi.org/10.1093/gigascience/giac046
work_keys_str_mv	AT kandurichakravarthi profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification AT pavlovicmilena profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification AT schefferlonneke profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification AT motwanikeshav profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification AT chernigovskayamaria profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification AT greiffvictor profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification AT sandvegeirk profilingthebaselineperformanceandlimitsofmachinelearningmodelsforadaptiveimmunereceptorrepertoireclassification

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification

Ejemplares similares