Cargando…

simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods

BACKGROUND: Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing app...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kanduri, Chakravarthi, Scheffer, Lonneke, Pavlović, Milena, Rand, Knut Dagestad, Chernigovskaya, Maria, Pirvandy, Oz, Yaari, Gur, Greiff, Victor, Sandve, Geir K
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10580376/ https://www.ncbi.nlm.nih.gov/pubmed/37848619 http://dx.doi.org/10.1093/gigascience/giad074

_version_	1785121931157045248
author	Kanduri, Chakravarthi Scheffer, Lonneke Pavlović, Milena Rand, Knut Dagestad Chernigovskaya, Maria Pirvandy, Oz Yaari, Gur Greiff, Victor Sandve, Geir K
author_facet	Kanduri, Chakravarthi Scheffer, Lonneke Pavlović, Milena Rand, Knut Dagestad Chernigovskaya, Maria Pirvandy, Oz Yaari, Gur Greiff, Victor Sandve, Geir K
author_sort	Kanduri, Chakravarthi
collection	PubMed
description	BACKGROUND: Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires. RESULTS: We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state–associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets. CONCLUSIONS: This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.
format	Online Article Text
id	pubmed-10580376
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-105803762023-10-18 simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods Kanduri, Chakravarthi Scheffer, Lonneke Pavlović, Milena Rand, Knut Dagestad Chernigovskaya, Maria Pirvandy, Oz Yaari, Gur Greiff, Victor Sandve, Geir K Gigascience Technical Note BACKGROUND: Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires. RESULTS: We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state–associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets. CONCLUSIONS: This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR. Oxford University Press 2023-10-17 /pmc/articles/PMC10580376/ /pubmed/37848619 http://dx.doi.org/10.1093/gigascience/giad074 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Technical Note Kanduri, Chakravarthi Scheffer, Lonneke Pavlović, Milena Rand, Knut Dagestad Chernigovskaya, Maria Pirvandy, Oz Yaari, Gur Greiff, Victor Sandve, Geir K simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
title	simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
title_full	simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
title_fullStr	simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
title_full_unstemmed	simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
title_short	simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
title_sort	simairr: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10580376/ https://www.ncbi.nlm.nih.gov/pubmed/37848619 http://dx.doi.org/10.1093/gigascience/giad074
work_keys_str_mv	AT kandurichakravarthi simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT schefferlonneke simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT pavlovicmilena simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT randknutdagestad simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT chernigovskayamaria simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT pirvandyoz simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT yaarigur simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT greiffvictor simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods AT sandvegeirk simairrsimulationofadaptiveimmunerepertoireswithrealisticreceptorsequencesharingforbenchmarkingofimmunestatepredictionmethods

simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods

Ejemplares similares