Cargando…
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8516232/ https://www.ncbi.nlm.nih.gov/pubmed/34648558 http://dx.doi.org/10.1371/journal.pone.0258693 |
_version_ | 1784583755008049152 |
---|---|
author | Bussi, Yuval Kapon, Ruti Reich, Ziv |
author_facet | Bussi, Yuval Kapon, Ruti Reich, Ziv |
author_sort | Bussi, Yuval |
collection | PubMed |
description | Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity. |
format | Online Article Text |
id | pubmed-8516232 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-85162322021-10-15 Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy Bussi, Yuval Kapon, Ruti Reich, Ziv PLoS One Research Article Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity. Public Library of Science 2021-10-14 /pmc/articles/PMC8516232/ /pubmed/34648558 http://dx.doi.org/10.1371/journal.pone.0258693 Text en © 2021 Bussi et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Bussi, Yuval Kapon, Ruti Reich, Ziv Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy |
title | Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy |
title_full | Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy |
title_fullStr | Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy |
title_full_unstemmed | Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy |
title_short | Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy |
title_sort | large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8516232/ https://www.ncbi.nlm.nih.gov/pubmed/34648558 http://dx.doi.org/10.1371/journal.pone.0258693 |
work_keys_str_mv | AT bussiyuval largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy AT kaponruti largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy AT reichziv largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy |