Cargando…

Frequency Analysis Techniques for Identification of Viral Genetic Data

Environmental metagenomic samples and samples obtained as an attempt to identify a pathogen associated with the emergence of a novel infectious disease are important sources of novel microorganisms. The low costs and high throughput of sequencing technologies are expected to allow for the genetic ma...

Descripción completa

Detalles Bibliográficos
Autores principales: Trifonov, Vladimir, Rabadan, Raul
Formato: Texto
Lenguaje:English
Publicado: American Society of Microbiology 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2932508/
https://www.ncbi.nlm.nih.gov/pubmed/20824103
http://dx.doi.org/10.1128/mBio.00156-10
_version_ 1782186084675878912
author Trifonov, Vladimir
Rabadan, Raul
author_facet Trifonov, Vladimir
Rabadan, Raul
author_sort Trifonov, Vladimir
collection PubMed
description Environmental metagenomic samples and samples obtained as an attempt to identify a pathogen associated with the emergence of a novel infectious disease are important sources of novel microorganisms. The low costs and high throughput of sequencing technologies are expected to allow for the genetic material in those samples to be sequenced and the genomes of the novel microorganisms to be identified by alignment to those in a database of known genomes. Yet, for various biological and technical reasons, such alignment might not always be possible. We investigate a frequency analysis technique which on one hand allows for the identification of genetic material without relying on alignment and on the other hand makes possible the discovery of nonoverlapping contigs from the same organism. The technique is based on obtaining signatures of the genetic data and defining a distance/similarity measure between signatures. More precisely, the signatures of the genetic data are the frequencies of k-mers occurring in them, with k being a natural number. We considered an entropy-based distance between signatures, similar to the Kullback-Leibler distance in information theory, and investigated its ability to categorize negative-sense single-stranded RNA (ssRNA) viral genetic data. Our conclusion is that in this viral context, the technique provides a viable way of discovering genetic relationships without relying on alignment. We envision that our approach will be applicable to other microbial genetic contexts, e.g., other types of viruses, and will be an important tool in the discovery of novel microorganisms.
format Text
id pubmed-2932508
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher American Society of Microbiology
record_format MEDLINE/PubMed
spelling pubmed-29325082010-09-03 Frequency Analysis Techniques for Identification of Viral Genetic Data Trifonov, Vladimir Rabadan, Raul mBio Research Article Environmental metagenomic samples and samples obtained as an attempt to identify a pathogen associated with the emergence of a novel infectious disease are important sources of novel microorganisms. The low costs and high throughput of sequencing technologies are expected to allow for the genetic material in those samples to be sequenced and the genomes of the novel microorganisms to be identified by alignment to those in a database of known genomes. Yet, for various biological and technical reasons, such alignment might not always be possible. We investigate a frequency analysis technique which on one hand allows for the identification of genetic material without relying on alignment and on the other hand makes possible the discovery of nonoverlapping contigs from the same organism. The technique is based on obtaining signatures of the genetic data and defining a distance/similarity measure between signatures. More precisely, the signatures of the genetic data are the frequencies of k-mers occurring in them, with k being a natural number. We considered an entropy-based distance between signatures, similar to the Kullback-Leibler distance in information theory, and investigated its ability to categorize negative-sense single-stranded RNA (ssRNA) viral genetic data. Our conclusion is that in this viral context, the technique provides a viable way of discovering genetic relationships without relying on alignment. We envision that our approach will be applicable to other microbial genetic contexts, e.g., other types of viruses, and will be an important tool in the discovery of novel microorganisms. American Society of Microbiology 2010-08-24 /pmc/articles/PMC2932508/ /pubmed/20824103 http://dx.doi.org/10.1128/mBio.00156-10 Text en Copyright © 2010 Trifonov and Rabadan. http://creativecommons.org/licenses/by-nc-sa/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License (http://creativecommons.org/licenses/by-nc-sa/3.0/) , which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Trifonov, Vladimir
Rabadan, Raul
Frequency Analysis Techniques for Identification of Viral Genetic Data
title Frequency Analysis Techniques for Identification of Viral Genetic Data
title_full Frequency Analysis Techniques for Identification of Viral Genetic Data
title_fullStr Frequency Analysis Techniques for Identification of Viral Genetic Data
title_full_unstemmed Frequency Analysis Techniques for Identification of Viral Genetic Data
title_short Frequency Analysis Techniques for Identification of Viral Genetic Data
title_sort frequency analysis techniques for identification of viral genetic data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2932508/
https://www.ncbi.nlm.nih.gov/pubmed/20824103
http://dx.doi.org/10.1128/mBio.00156-10
work_keys_str_mv AT trifonovvladimir frequencyanalysistechniquesforidentificationofviralgeneticdata
AT rabadanraul frequencyanalysistechniquesforidentificationofviralgeneticdata