Cargando…

An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage

SIMPLE SUMMARY: Viral sequence variation can expand the host repertoire, enhance the infection ability, and/or prevent the build-up of a long-term specific immunity by the host. The study of viral diversity is, thus, critical to understand sequence change and its implications for intervention strate...

Descripción completa

Detalles Bibliográficos
Autores principales: Chong, Li Chuin, Lim, Wei Lun, Ban, Kenneth Hon Kim, Khan, Asif M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8466476/
https://www.ncbi.nlm.nih.gov/pubmed/34571730
http://dx.doi.org/10.3390/biology10090853
_version_ 1784573149141008384
author Chong, Li Chuin
Lim, Wei Lun
Ban, Kenneth Hon Kim
Khan, Asif M.
author_facet Chong, Li Chuin
Lim, Wei Lun
Ban, Kenneth Hon Kim
Khan, Asif M.
author_sort Chong, Li Chuin
collection PubMed
description SIMPLE SUMMARY: Viral sequence variation can expand the host repertoire, enhance the infection ability, and/or prevent the build-up of a long-term specific immunity by the host. The study of viral diversity is, thus, critical to understand sequence change and its implications for intervention strategies. Typically, these studies are performed using alignment-dependent approaches. However, such an approach becomes limited with increase in sequence diversity. Herein, we present an alignment-free algorithm, implemented as a publicly available tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. UNIQmin enables the generation of a minimal set for a given sequence dataset of interest and is applicable to big data, with a reasonable time performance. The minimal set is the smallest possible number of unique sequences required to represent a given peptidome diversity (pool of distinct peptides of a specific length) exhibited by a non-redundant dataset. This compression is possible through the removal of unique sequences that do not contribute effectively to the peptidome diversity pool. The utility of UNIQmin was demonstrated for the species Dengue virus, genus Flavivirus, family Flaviviridae, and superkingdom Viruses. The concept of a minimal set is generic and thus possibly applicable to both genomic and proteomic data of non-viral, pathogenic microorganisms. ABSTRACT: The study of viral diversity is imperative in understanding sequence change and its implications for intervention strategies. The widely used alignment-dependent approaches to study viral diversity are limited in their utility as sequence dissimilarity increases, particularly when expanded to the genus or higher ranks of viral species lineage. Herein, we present an alignment-independent algorithm, implemented as a tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. This is done by performing an exhaustive search to generate the minimal set of sequences for a given viral non-redundant sequence dataset. The minimal set is comprised of the smallest possible number of unique sequences required to capture the diversity inherent in the complete set of overlapping k-mers encoded by all the unique sequences in the given dataset. Such dataset compression is possible through the removal of unique sequences, whose entire repertoire of overlapping k-mers can be represented by other sequences, thus rendering them redundant to the collective pool of sequence diversity. A significant reduction, namely ~44%, ~45%, and ~53%, was observed for all reported unique sequences of species Dengue virus, genus Flavivirus, and family Flaviviridae, respectively, while still capturing the entire repertoire of nonamer (9-mer) viral peptidome diversity present in the initial input dataset. The algorithm is scalable for big data as it was applied to ~2.2 million non-redundant sequences of all reported viruses. UNIQmin is open source and publicly available on GitHub. The concept of a minimal set is generic and, thus, potentially applicable to other pathogenic microorganisms of non-viral origin, such as bacteria.
format Online
Article
Text
id pubmed-8466476
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-84664762021-09-27 An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage Chong, Li Chuin Lim, Wei Lun Ban, Kenneth Hon Kim Khan, Asif M. Biology (Basel) Article SIMPLE SUMMARY: Viral sequence variation can expand the host repertoire, enhance the infection ability, and/or prevent the build-up of a long-term specific immunity by the host. The study of viral diversity is, thus, critical to understand sequence change and its implications for intervention strategies. Typically, these studies are performed using alignment-dependent approaches. However, such an approach becomes limited with increase in sequence diversity. Herein, we present an alignment-free algorithm, implemented as a publicly available tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. UNIQmin enables the generation of a minimal set for a given sequence dataset of interest and is applicable to big data, with a reasonable time performance. The minimal set is the smallest possible number of unique sequences required to represent a given peptidome diversity (pool of distinct peptides of a specific length) exhibited by a non-redundant dataset. This compression is possible through the removal of unique sequences that do not contribute effectively to the peptidome diversity pool. The utility of UNIQmin was demonstrated for the species Dengue virus, genus Flavivirus, family Flaviviridae, and superkingdom Viruses. The concept of a minimal set is generic and thus possibly applicable to both genomic and proteomic data of non-viral, pathogenic microorganisms. ABSTRACT: The study of viral diversity is imperative in understanding sequence change and its implications for intervention strategies. The widely used alignment-dependent approaches to study viral diversity are limited in their utility as sequence dissimilarity increases, particularly when expanded to the genus or higher ranks of viral species lineage. Herein, we present an alignment-independent algorithm, implemented as a tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. This is done by performing an exhaustive search to generate the minimal set of sequences for a given viral non-redundant sequence dataset. The minimal set is comprised of the smallest possible number of unique sequences required to capture the diversity inherent in the complete set of overlapping k-mers encoded by all the unique sequences in the given dataset. Such dataset compression is possible through the removal of unique sequences, whose entire repertoire of overlapping k-mers can be represented by other sequences, thus rendering them redundant to the collective pool of sequence diversity. A significant reduction, namely ~44%, ~45%, and ~53%, was observed for all reported unique sequences of species Dengue virus, genus Flavivirus, and family Flaviviridae, respectively, while still capturing the entire repertoire of nonamer (9-mer) viral peptidome diversity present in the initial input dataset. The algorithm is scalable for big data as it was applied to ~2.2 million non-redundant sequences of all reported viruses. UNIQmin is open source and publicly available on GitHub. The concept of a minimal set is generic and, thus, potentially applicable to other pathogenic microorganisms of non-viral origin, such as bacteria. MDPI 2021-08-31 /pmc/articles/PMC8466476/ /pubmed/34571730 http://dx.doi.org/10.3390/biology10090853 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Chong, Li Chuin
Lim, Wei Lun
Ban, Kenneth Hon Kim
Khan, Asif M.
An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage
title An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage
title_full An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage
title_fullStr An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage
title_full_unstemmed An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage
title_short An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage
title_sort alignment-independent approach for the study of viral sequence diversity at any given rank of taxonomy lineage
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8466476/
https://www.ncbi.nlm.nih.gov/pubmed/34571730
http://dx.doi.org/10.3390/biology10090853
work_keys_str_mv AT chonglichuin analignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage
AT limweilun analignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage
AT bankennethhonkim analignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage
AT khanasifm analignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage
AT chonglichuin alignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage
AT limweilun alignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage
AT bankennethhonkim alignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage
AT khanasifm alignmentindependentapproachforthestudyofviralsequencediversityatanygivenrankoftaxonomylineage