Cargando…

K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features

In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding posit...

Descripción completa

Detalles Bibliográficos
Autores principales: Sievers, Aaron, Bosiek, Katharina, Bisch, Marc, Dreessen, Chris, Riedel, Jascha, Froß, Patrick, Hausmann, Michael, Hildenbrand, Georg
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5406869/
https://www.ncbi.nlm.nih.gov/pubmed/28422050
http://dx.doi.org/10.3390/genes8040122
_version_ 1783232051172868096
author Sievers, Aaron
Bosiek, Katharina
Bisch, Marc
Dreessen, Chris
Riedel, Jascha
Froß, Patrick
Hausmann, Michael
Hildenbrand, Georg
author_facet Sievers, Aaron
Bosiek, Katharina
Bisch, Marc
Dreessen, Chris
Riedel, Jascha
Froß, Patrick
Hausmann, Michael
Hildenbrand, Georg
author_sort Sievers, Aaron
collection PubMed
description In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local k-mer spectra (frequency distribution of k-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ k ≤ 4) on relatively small viral genomes of Papillomaviridae and Herpesviridae, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in Papillomaviridae and Herpesviridae formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the k-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown k-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest k-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard k-mer analysis.
format Online
Article
Text
id pubmed-5406869
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-54068692017-04-27 K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features Sievers, Aaron Bosiek, Katharina Bisch, Marc Dreessen, Chris Riedel, Jascha Froß, Patrick Hausmann, Michael Hildenbrand, Georg Genes (Basel) Article In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local k-mer spectra (frequency distribution of k-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ k ≤ 4) on relatively small viral genomes of Papillomaviridae and Herpesviridae, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in Papillomaviridae and Herpesviridae formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the k-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown k-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest k-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard k-mer analysis. MDPI 2017-04-19 /pmc/articles/PMC5406869/ /pubmed/28422050 http://dx.doi.org/10.3390/genes8040122 Text en © 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Sievers, Aaron
Bosiek, Katharina
Bisch, Marc
Dreessen, Chris
Riedel, Jascha
Froß, Patrick
Hausmann, Michael
Hildenbrand, Georg
K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features
title K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features
title_full K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features
title_fullStr K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features
title_full_unstemmed K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features
title_short K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features
title_sort k-mer content, correlation, and position analysis of genome dna sequences for the identification of function and evolutionary features
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5406869/
https://www.ncbi.nlm.nih.gov/pubmed/28422050
http://dx.doi.org/10.3390/genes8040122
work_keys_str_mv AT sieversaaron kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures
AT bosiekkatharina kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures
AT bischmarc kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures
AT dreessenchris kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures
AT riedeljascha kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures
AT froßpatrick kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures
AT hausmannmichael kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures
AT hildenbrandgeorg kmercontentcorrelationandpositionanalysisofgenomednasequencesfortheidentificationoffunctionandevolutionaryfeatures