Cargando…
AntiRef: reference clusters of human antibody sequences
MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, auto...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10598580/ https://www.ncbi.nlm.nih.gov/pubmed/37886711 http://dx.doi.org/10.1093/bioadv/vbad109 |
_version_ | 1785125585967644672 |
---|---|
author | Briney, Bryan |
author_facet | Briney, Bryan |
author_sort | Briney, Bryan |
collection | PubMed |
description | MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, autoimmunity, and a small number of infectious diseases that includes HIV, influenza, and SARS-CoV-2. These biases and redundancies are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine-learning models. Identity-based clustering provides a solution; however, the extremely large size of available antibody sequence datasets makes such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data. RESULTS: Antibody Reference Clusters (AntiRef), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef for general protein sequences are suboptimal for antibody clustering. Starting with an input dataset of ∼451M full-length, productive human antibody sequences, AntiRef provides reference datasets clustered at a range of antibody-optimized identity thresholds. AntiRef90 is one-third the size of the input dataset and less than half the size of the non-redundant AntiRef100. AVAILABILITY AND IMPLEMENTATION: AntiRef datasets are available on Zenodo (zenodo.org/record/7474336). All code used to generate AntiRef is available on GitHub (github.com/briney/antiref). The AntiRef versioning scheme (current version: v2022.12.14) refers to the date on which sequences were retrieved from OAS. |
format | Online Article Text |
id | pubmed-10598580 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-105985802023-10-26 AntiRef: reference clusters of human antibody sequences Briney, Bryan Bioinform Adv Application Note MOTIVATION: Genetic biases in the human antibody repertoire result in publicly available antibody sequence datasets that contain many duplicate or highly similar sequences. Available datasets are further skewed by the predominance of studies focused on specific disease states, primarily cancer, autoimmunity, and a small number of infectious diseases that includes HIV, influenza, and SARS-CoV-2. These biases and redundancies are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine-learning models. Identity-based clustering provides a solution; however, the extremely large size of available antibody sequence datasets makes such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data. RESULTS: Antibody Reference Clusters (AntiRef), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef for general protein sequences are suboptimal for antibody clustering. Starting with an input dataset of ∼451M full-length, productive human antibody sequences, AntiRef provides reference datasets clustered at a range of antibody-optimized identity thresholds. AntiRef90 is one-third the size of the input dataset and less than half the size of the non-redundant AntiRef100. AVAILABILITY AND IMPLEMENTATION: AntiRef datasets are available on Zenodo (zenodo.org/record/7474336). All code used to generate AntiRef is available on GitHub (github.com/briney/antiref). The AntiRef versioning scheme (current version: v2022.12.14) refers to the date on which sequences were retrieved from OAS. Oxford University Press 2023-08-22 /pmc/articles/PMC10598580/ /pubmed/37886711 http://dx.doi.org/10.1093/bioadv/vbad109 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Application Note Briney, Bryan AntiRef: reference clusters of human antibody sequences |
title | AntiRef: reference clusters of human antibody sequences |
title_full | AntiRef: reference clusters of human antibody sequences |
title_fullStr | AntiRef: reference clusters of human antibody sequences |
title_full_unstemmed | AntiRef: reference clusters of human antibody sequences |
title_short | AntiRef: reference clusters of human antibody sequences |
title_sort | antiref: reference clusters of human antibody sequences |
topic | Application Note |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10598580/ https://www.ncbi.nlm.nih.gov/pubmed/37886711 http://dx.doi.org/10.1093/bioadv/vbad109 |
work_keys_str_mv | AT brineybryan antirefreferenceclustersofhumanantibodysequences |