Cargando…

The complexity landscape of viral genomes

BACKGROUND: Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically...

Descripción completa

Detalles Bibliográficos
Autores principales:	Silva, Jorge Miguel, Pratas, Diogo, Caetano, Tânia, Matos, Sérgio
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9366995/ https://www.ncbi.nlm.nih.gov/pubmed/35950839 http://dx.doi.org/10.1093/gigascience/giac079

_version_	1784765691293859840
author	Silva, Jorge Miguel Pratas, Diogo Caetano, Tânia Matos, Sérgio
author_facet	Silva, Jorge Miguel Pratas, Diogo Caetano, Tânia Matos, Sérgio
author_sort	Silva, Jorge Miguel
collection	PubMed
description	BACKGROUND: Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes’ organization, relation, and fundamental characteristics. RESULTS: This work provides a comprehensive landscape of the viral genome’s complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. CONCLUSIONS: This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes’ organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.
format	Online Article Text
id	pubmed-9366995
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-93669952022-08-12 The complexity landscape of viral genomes Silva, Jorge Miguel Pratas, Diogo Caetano, Tânia Matos, Sérgio Gigascience Research BACKGROUND: Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes’ organization, relation, and fundamental characteristics. RESULTS: This work provides a comprehensive landscape of the viral genome’s complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. CONCLUSIONS: This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes’ organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches. Oxford University Press 2022-08-11 /pmc/articles/PMC9366995/ /pubmed/35950839 http://dx.doi.org/10.1093/gigascience/giac079 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Silva, Jorge Miguel Pratas, Diogo Caetano, Tânia Matos, Sérgio The complexity landscape of viral genomes
title	The complexity landscape of viral genomes
title_full	The complexity landscape of viral genomes
title_fullStr	The complexity landscape of viral genomes
title_full_unstemmed	The complexity landscape of viral genomes
title_short	The complexity landscape of viral genomes
title_sort	complexity landscape of viral genomes
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9366995/ https://www.ncbi.nlm.nih.gov/pubmed/35950839 http://dx.doi.org/10.1093/gigascience/giac079
work_keys_str_mv	AT silvajorgemiguel thecomplexitylandscapeofviralgenomes AT pratasdiogo thecomplexitylandscapeofviralgenomes AT caetanotania thecomplexitylandscapeofviralgenomes AT matossergio thecomplexitylandscapeofviralgenomes AT silvajorgemiguel complexitylandscapeofviralgenomes AT pratasdiogo complexitylandscapeofviralgenomes AT caetanotania complexitylandscapeofviralgenomes AT matossergio complexitylandscapeofviralgenomes

The complexity landscape of viral genomes

Ejemplares similares