Cargando…

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy

Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by...

Descripción completa

Detalles Bibliográficos
Autores principales: Bussi, Yuval, Kapon, Ruti, Reich, Ziv
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8516232/
https://www.ncbi.nlm.nih.gov/pubmed/34648558
http://dx.doi.org/10.1371/journal.pone.0258693
_version_ 1784583755008049152
author Bussi, Yuval
Kapon, Ruti
Reich, Ziv
author_facet Bussi, Yuval
Kapon, Ruti
Reich, Ziv
author_sort Bussi, Yuval
collection PubMed
description Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
format Online
Article
Text
id pubmed-8516232
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-85162322021-10-15 Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy Bussi, Yuval Kapon, Ruti Reich, Ziv PLoS One Research Article Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity. Public Library of Science 2021-10-14 /pmc/articles/PMC8516232/ /pubmed/34648558 http://dx.doi.org/10.1371/journal.pone.0258693 Text en © 2021 Bussi et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Bussi, Yuval
Kapon, Ruti
Reich, Ziv
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
title Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
title_full Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
title_fullStr Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
title_full_unstemmed Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
title_short Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
title_sort large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8516232/
https://www.ncbi.nlm.nih.gov/pubmed/34648558
http://dx.doi.org/10.1371/journal.pone.0258693
work_keys_str_mv AT bussiyuval largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy
AT kaponruti largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy
AT reichziv largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy