Cargando…

Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm

In jawed vertebrates, variable (V) genes code for antigen-binding regions of B and T lymphocyte receptors, which generate a specific response to foreign pathogens. Obtaining the detailed repertoire of these genes across the jawed vertebrate kingdom would help to understand their evolution and functi...

Descripción completa

Detalles Bibliográficos
Autores principales: Olivieri, David N., Gambón-Deza, Francisco
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6388353/
https://www.ncbi.nlm.nih.gov/pubmed/30886642
http://dx.doi.org/10.1155/2019/3780245
_version_ 1783397751922360320
author Olivieri, David N.
Gambón-Deza, Francisco
author_facet Olivieri, David N.
Gambón-Deza, Francisco
author_sort Olivieri, David N.
collection PubMed
description In jawed vertebrates, variable (V) genes code for antigen-binding regions of B and T lymphocyte receptors, which generate a specific response to foreign pathogens. Obtaining the detailed repertoire of these genes across the jawed vertebrate kingdom would help to understand their evolution and function. However, annotations of V-genes are known for only a few model species since their extraction is not amenable to standard gene finding algorithms. Also, the more distant evolution of a taxon is from such model species, and there is less homology between their V-gene sequences. Here, we present an iterative supervised machine learning algorithm that begins by training a small set of known and verified V-gene sequences. The algorithm successively discovers homologous unaligned V-exons from a larger set of whole genome shotgun (WGS) datasets from many taxa. Upon each iteration, newly uncovered V-genes are added to the training set for the next predictions. This iterative learning/discovery process terminates when the number of new sequences discovered is negligible. This process is akin to “online” or reinforcement learning and is proven to be useful for discovering homologous V-genes from successively more distant taxa from the original set. Results are demonstrated for 14 primate WGS datasets and validated against Ensembl annotations. This algorithm is implemented in the Python programming language and is freely available at http://vgenerepertoire.org.
format Online
Article
Text
id pubmed-6388353
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-63883532019-03-18 Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm Olivieri, David N. Gambón-Deza, Francisco Comput Math Methods Med Research Article In jawed vertebrates, variable (V) genes code for antigen-binding regions of B and T lymphocyte receptors, which generate a specific response to foreign pathogens. Obtaining the detailed repertoire of these genes across the jawed vertebrate kingdom would help to understand their evolution and function. However, annotations of V-genes are known for only a few model species since their extraction is not amenable to standard gene finding algorithms. Also, the more distant evolution of a taxon is from such model species, and there is less homology between their V-gene sequences. Here, we present an iterative supervised machine learning algorithm that begins by training a small set of known and verified V-gene sequences. The algorithm successively discovers homologous unaligned V-exons from a larger set of whole genome shotgun (WGS) datasets from many taxa. Upon each iteration, newly uncovered V-genes are added to the training set for the next predictions. This iterative learning/discovery process terminates when the number of new sequences discovered is negligible. This process is akin to “online” or reinforcement learning and is proven to be useful for discovering homologous V-genes from successively more distant taxa from the original set. Results are demonstrated for 14 primate WGS datasets and validated against Ensembl annotations. This algorithm is implemented in the Python programming language and is freely available at http://vgenerepertoire.org. Hindawi 2019-02-11 /pmc/articles/PMC6388353/ /pubmed/30886642 http://dx.doi.org/10.1155/2019/3780245 Text en Copyright © 2019 David N. Olivieri and Francisco Gambón-Deza. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Olivieri, David N.
Gambón-Deza, Francisco
Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm
title Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm
title_full Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm
title_fullStr Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm
title_full_unstemmed Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm
title_short Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm
title_sort iterative variable gene discovery from whole genome sequencing with a bootstrapped multiresolution algorithm
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6388353/
https://www.ncbi.nlm.nih.gov/pubmed/30886642
http://dx.doi.org/10.1155/2019/3780245
work_keys_str_mv AT olivieridavidn iterativevariablegenediscoveryfromwholegenomesequencingwithabootstrappedmultiresolutionalgorithm
AT gambondezafrancisco iterativevariablegenediscoveryfromwholegenomesequencingwithabootstrappedmultiresolutionalgorithm