Cargando…

An efficient numerical representation of genome sequence: natural vector with covariance component

BACKGROUND: The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of t...

Descripción completa

Detalles Bibliográficos
Autores principales: Sun, Nan, Zhao, Xin, Yau, Stephen S.-T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9206847/
https://www.ncbi.nlm.nih.gov/pubmed/35729905
http://dx.doi.org/10.7717/peerj.13544
_version_ 1784729407375540224
author Sun, Nan
Zhao, Xin
Yau, Stephen S.-T.
author_facet Sun, Nan
Zhao, Xin
Yau, Stephen S.-T.
author_sort Sun, Nan
collection PubMed
description BACKGROUND: The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets. METHODS: We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences. RESULTS: First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms.
format Online
Article
Text
id pubmed-9206847
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-92068472022-06-20 An efficient numerical representation of genome sequence: natural vector with covariance component Sun, Nan Zhao, Xin Yau, Stephen S.-T. PeerJ Bioinformatics BACKGROUND: The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets. METHODS: We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences. RESULTS: First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms. PeerJ Inc. 2022-06-16 /pmc/articles/PMC9206847/ /pubmed/35729905 http://dx.doi.org/10.7717/peerj.13544 Text en © 2022 Sun et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Sun, Nan
Zhao, Xin
Yau, Stephen S.-T.
An efficient numerical representation of genome sequence: natural vector with covariance component
title An efficient numerical representation of genome sequence: natural vector with covariance component
title_full An efficient numerical representation of genome sequence: natural vector with covariance component
title_fullStr An efficient numerical representation of genome sequence: natural vector with covariance component
title_full_unstemmed An efficient numerical representation of genome sequence: natural vector with covariance component
title_short An efficient numerical representation of genome sequence: natural vector with covariance component
title_sort efficient numerical representation of genome sequence: natural vector with covariance component
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9206847/
https://www.ncbi.nlm.nih.gov/pubmed/35729905
http://dx.doi.org/10.7717/peerj.13544
work_keys_str_mv AT sunnan anefficientnumericalrepresentationofgenomesequencenaturalvectorwithcovariancecomponent
AT zhaoxin anefficientnumericalrepresentationofgenomesequencenaturalvectorwithcovariancecomponent
AT yaustephenst anefficientnumericalrepresentationofgenomesequencenaturalvectorwithcovariancecomponent
AT sunnan efficientnumericalrepresentationofgenomesequencenaturalvectorwithcovariancecomponent
AT zhaoxin efficientnumericalrepresentationofgenomesequencenaturalvectorwithcovariancecomponent
AT yaustephenst efficientnumericalrepresentationofgenomesequencenaturalvectorwithcovariancecomponent