Cargando…

KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis

Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explo...

Descripción completa

Detalles Bibliográficos
Autores principales: Pornputtapong, Natapol, Acheampong, Daniel A., Patumcharoenpol, Preecha, Jenjaroenpun, Piroon, Wongsurawat, Thidathip, Jun, Se-Ran, Yongkiettrakul, Suganya, Chokesajjawatee, Nipa, Nookaew, Intawat
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7538862/
https://www.ncbi.nlm.nih.gov/pubmed/33072720
http://dx.doi.org/10.3389/fbioe.2020.556413
_version_ 1783590945871101952
author Pornputtapong, Natapol
Acheampong, Daniel A.
Patumcharoenpol, Preecha
Jenjaroenpun, Piroon
Wongsurawat, Thidathip
Jun, Se-Ran
Yongkiettrakul, Suganya
Chokesajjawatee, Nipa
Nookaew, Intawat
author_facet Pornputtapong, Natapol
Acheampong, Daniel A.
Patumcharoenpol, Preecha
Jenjaroenpun, Piroon
Wongsurawat, Thidathip
Jun, Se-Ran
Yongkiettrakul, Suganya
Chokesajjawatee, Nipa
Nookaew, Intawat
author_sort Pornputtapong, Natapol
collection PubMed
description Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.
format Online
Article
Text
id pubmed-7538862
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-75388622020-10-15 KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis Pornputtapong, Natapol Acheampong, Daniel A. Patumcharoenpol, Preecha Jenjaroenpun, Piroon Wongsurawat, Thidathip Jun, Se-Ran Yongkiettrakul, Suganya Chokesajjawatee, Nipa Nookaew, Intawat Front Bioeng Biotechnol Bioengineering and Biotechnology Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune. Frontiers Media S.A. 2020-09-23 /pmc/articles/PMC7538862/ /pubmed/33072720 http://dx.doi.org/10.3389/fbioe.2020.556413 Text en Copyright © 2020 Pornputtapong, Acheampong, Patumcharoenpol, Jenjaroenpun, Wongsurawat, Jun, Yongkiettrakul, Chokesajjawatee and Nookaew. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Bioengineering and Biotechnology
Pornputtapong, Natapol
Acheampong, Daniel A.
Patumcharoenpol, Preecha
Jenjaroenpun, Piroon
Wongsurawat, Thidathip
Jun, Se-Ran
Yongkiettrakul, Suganya
Chokesajjawatee, Nipa
Nookaew, Intawat
KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis
title KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis
title_full KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis
title_fullStr KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis
title_full_unstemmed KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis
title_short KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis
title_sort kitsune: a tool for identifying empirically optimal k-mer length for alignment-free phylogenomic analysis
topic Bioengineering and Biotechnology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7538862/
https://www.ncbi.nlm.nih.gov/pubmed/33072720
http://dx.doi.org/10.3389/fbioe.2020.556413
work_keys_str_mv AT pornputtapongnatapol kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT acheampongdaniela kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT patumcharoenpolpreecha kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT jenjaroenpunpiroon kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT wongsurawatthidathip kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT junseran kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT yongkiettrakulsuganya kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT chokesajjawateenipa kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis
AT nookaewintawat kitsuneatoolforidentifyingempiricallyoptimalkmerlengthforalignmentfreephylogenomicanalysis