Cargando…

AntiRef: reference clusters of human antibody sequences

MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, auto...

Descripción completa

Detalles Bibliográficos
Autor principal: Briney, Bryan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10598580/
https://www.ncbi.nlm.nih.gov/pubmed/37886711
http://dx.doi.org/10.1093/bioadv/vbad109
_version_ 1785125585967644672
author Briney, Bryan
author_facet Briney, Bryan
author_sort Briney, Bryan
collection PubMed
description MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, autoimmunity, and a small number of infectious diseases that includes HIV, influenza, and SARS-CoV-2. These biases and redundancies are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine-learning models. Identity-based clustering provides a solution; however, the extremely large size of available antibody sequence datasets makes such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data. RESULTS: Antibody Reference Clusters (AntiRef), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef for general protein sequences are suboptimal for antibody clustering. Starting with an input dataset of ∼451M full-length, productive human antibody sequences, AntiRef provides reference datasets clustered at a range of antibody-optimized identity thresholds. AntiRef90 is one-third the size of the input dataset and less than half the size of the non-redundant AntiRef100. AVAILABILITY AND IMPLEMENTATION: AntiRef datasets are available on Zenodo (zenodo.org/record/7474336). All code used to generate AntiRef is available on GitHub (github.com/briney/antiref). The AntiRef versioning scheme (current version: v2022.12.14) refers to the date on which sequences were retrieved from OAS.
format Online
Article
Text
id pubmed-10598580
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-105985802023-10-26 AntiRef: reference clusters of human antibody sequences Briney, Bryan Bioinform Adv Application Note MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, autoimmunity, and a small number of infectious diseases that includes HIV, influenza, and SARS-CoV-2. These biases and redundancies are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine-learning models. Identity-based clustering provides a solution; however, the extremely large size of available antibody sequence datasets makes such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data. RESULTS: Antibody Reference Clusters (AntiRef), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef for general protein sequences are suboptimal for antibody clustering. Starting with an input dataset of ∼451M full-length, productive human antibody sequences, AntiRef provides reference datasets clustered at a range of antibody-optimized identity thresholds. AntiRef90 is one-third the size of the input dataset and less than half the size of the non-redundant AntiRef100. AVAILABILITY AND IMPLEMENTATION: AntiRef datasets are available on Zenodo (zenodo.org/record/7474336). All code used to generate AntiRef is available on GitHub (github.com/briney/antiref). The AntiRef versioning scheme (current version: v2022.12.14) refers to the date on which sequences were retrieved from OAS. Oxford University Press 2023-08-22 /pmc/articles/PMC10598580/ /pubmed/37886711 http://dx.doi.org/10.1093/bioadv/vbad109 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Application Note
Briney, Bryan
AntiRef: reference clusters of human antibody sequences
title AntiRef: reference clusters of human antibody sequences
title_full AntiRef: reference clusters of human antibody sequences
title_fullStr AntiRef: reference clusters of human antibody sequences
title_full_unstemmed AntiRef: reference clusters of human antibody sequences
title_short AntiRef: reference clusters of human antibody sequences
title_sort antiref: reference clusters of human antibody sequences
topic Application Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10598580/
https://www.ncbi.nlm.nih.gov/pubmed/37886711
http://dx.doi.org/10.1093/bioadv/vbad109
work_keys_str_mv AT brineybryan antirefreferenceclustersofhumanantibodysequences