Cargando…

AntiRef: reference clusters of human antibody sequences

MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, auto...

Descripción completa

Detalles Bibliográficos
Autor principal:	Briney, Bryan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Application Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10598580/ https://www.ncbi.nlm.nih.gov/pubmed/37886711 http://dx.doi.org/10.1093/bioadv/vbad109

_version_	1785125585967644672
author	Briney, Bryan
author_facet	Briney, Bryan
author_sort	Briney, Bryan
collection	PubMed
description	MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, autoimmunity, and a small number of infectious diseases that includes HIV, influenza, and SARS-CoV-2. These biases and redundancies are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine-learning models. Identity-based clustering provides a solution; however, the extremely large size of available antibody sequence datasets makes such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data. RESULTS: Antibody Reference Clusters (AntiRef), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef for general protein sequences are suboptimal for antibody clustering. Starting with an input dataset of ∼451M full-length, productive human antibody sequences, AntiRef provides reference datasets clustered at a range of antibody-optimized identity thresholds. AntiRef90 is one-third the size of the input dataset and less than half the size of the non-redundant AntiRef100. AVAILABILITY AND IMPLEMENTATION: AntiRef datasets are available on Zenodo (zenodo.org/record/7474336). All code used to generate AntiRef is available on GitHub (github.com/briney/antiref). The AntiRef versioning scheme (current version: v2022.12.14) refers to the date on which sequences were retrieved from OAS.
format	Online Article Text
id	pubmed-10598580
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-105985802023-10-26 AntiRef: reference clusters of human antibody sequences Briney, Bryan Bioinform Adv Application Note MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, autoimmunity, and a small number of infectious diseases that includes HIV, influenza, and SARS-CoV-2. These biases and redundancies are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine-learning models. Identity-based clustering provides a solution; however, the extremely large size of available antibody sequence datasets makes such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data. RESULTS: Antibody Reference Clusters (AntiRef), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef for general protein sequences are suboptimal for antibody clustering. Starting with an input dataset of ∼451M full-length, productive human antibody sequences, AntiRef provides reference datasets clustered at a range of antibody-optimized identity thresholds. AntiRef90 is one-third the size of the input dataset and less than half the size of the non-redundant AntiRef100. AVAILABILITY AND IMPLEMENTATION: AntiRef datasets are available on Zenodo (zenodo.org/record/7474336). All code used to generate AntiRef is available on GitHub (github.com/briney/antiref). The AntiRef versioning scheme (current version: v2022.12.14) refers to the date on which sequences were retrieved from OAS. Oxford University Press 2023-08-22 /pmc/articles/PMC10598580/ /pubmed/37886711 http://dx.doi.org/10.1093/bioadv/vbad109 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Application Note Briney, Bryan AntiRef: reference clusters of human antibody sequences
title	AntiRef: reference clusters of human antibody sequences
title_full	AntiRef: reference clusters of human antibody sequences
title_fullStr	AntiRef: reference clusters of human antibody sequences
title_full_unstemmed	AntiRef: reference clusters of human antibody sequences
title_short	AntiRef: reference clusters of human antibody sequences
title_sort	antiref: reference clusters of human antibody sequences
topic	Application Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10598580/ https://www.ncbi.nlm.nih.gov/pubmed/37886711 http://dx.doi.org/10.1093/bioadv/vbad109
work_keys_str_mv	AT brineybryan antirefreferenceclustersofhumanantibodysequences

AntiRef: reference clusters of human antibody sequences

Ejemplares similares