Cargando…

Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers

Genome size variation within plant taxa is due to presence/absence variation, which may affect low-copy sequences or genomic repeats of various frequency classes. However, identifying the sequences underpinning genome size variation is challenging because genome assemblies commonly contain collapsed...

Descripción completa

Detalles Bibliográficos
Autores principales: Becher, Hannes, Sampson, Jacob, Twyford, Alex D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9372453/
https://www.ncbi.nlm.nih.gov/pubmed/35968114
http://dx.doi.org/10.3389/fpls.2022.818410
_version_ 1784767386442792960
author Becher, Hannes
Sampson, Jacob
Twyford, Alex D.
author_facet Becher, Hannes
Sampson, Jacob
Twyford, Alex D.
author_sort Becher, Hannes
collection PubMed
description Genome size variation within plant taxa is due to presence/absence variation, which may affect low-copy sequences or genomic repeats of various frequency classes. However, identifying the sequences underpinning genome size variation is challenging because genome assemblies commonly contain collapsed representations of repetitive sequences and because genome skimming studies by design miss low-copy number sequences. Here, we take a novel approach based on k-mers, short sub-sequences of equal length k, generated from whole-genome sequencing data of diploid eyebrights (Euphrasia), a group of plants that have considerable genome size variation within a ploidy level. We compare k-mer inventories within and between closely related species, and quantify the contribution of different copy number classes to genome size differences. We further match high-copy number k-mers to specific repeat types as retrieved from the RepeatExplorer2 pipeline. We find genome size differences of up to 230Mbp, equivalent to more than 20% genome size variation. The largest contributions to these differences come from rDNA sequences, a 145-nt genomic satellite and a repeat associated with an Angela transposable element. We also find size differences in the low-copy number class (copy number ≤ 10×) of up to 27 Mbp, possibly indicating differences in gene space between our samples. We demonstrate that it is possible to pinpoint the sequences causing genome size variation within species without the use of a reference genome. Such sequences can serve as targets for future cytogenetic studies. We also show that studies of genome size variation should go beyond repeats if they aim to characterise the full range of genomic variants. To allow future work with other taxonomic groups, we share our k-mer analysis pipeline, which is straightforward to run, relying largely on standard GNU command line tools.
format Online
Article
Text
id pubmed-9372453
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-93724532022-08-13 Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers Becher, Hannes Sampson, Jacob Twyford, Alex D. Front Plant Sci Plant Science Genome size variation within plant taxa is due to presence/absence variation, which may affect low-copy sequences or genomic repeats of various frequency classes. However, identifying the sequences underpinning genome size variation is challenging because genome assemblies commonly contain collapsed representations of repetitive sequences and because genome skimming studies by design miss low-copy number sequences. Here, we take a novel approach based on k-mers, short sub-sequences of equal length k, generated from whole-genome sequencing data of diploid eyebrights (Euphrasia), a group of plants that have considerable genome size variation within a ploidy level. We compare k-mer inventories within and between closely related species, and quantify the contribution of different copy number classes to genome size differences. We further match high-copy number k-mers to specific repeat types as retrieved from the RepeatExplorer2 pipeline. We find genome size differences of up to 230Mbp, equivalent to more than 20% genome size variation. The largest contributions to these differences come from rDNA sequences, a 145-nt genomic satellite and a repeat associated with an Angela transposable element. We also find size differences in the low-copy number class (copy number ≤ 10×) of up to 27 Mbp, possibly indicating differences in gene space between our samples. We demonstrate that it is possible to pinpoint the sequences causing genome size variation within species without the use of a reference genome. Such sequences can serve as targets for future cytogenetic studies. We also show that studies of genome size variation should go beyond repeats if they aim to characterise the full range of genomic variants. To allow future work with other taxonomic groups, we share our k-mer analysis pipeline, which is straightforward to run, relying largely on standard GNU command line tools. Frontiers Media S.A. 2022-07-29 /pmc/articles/PMC9372453/ /pubmed/35968114 http://dx.doi.org/10.3389/fpls.2022.818410 Text en Copyright © 2022 Becher, Sampson and Twyford. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Plant Science
Becher, Hannes
Sampson, Jacob
Twyford, Alex D.
Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers
title Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers
title_full Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers
title_fullStr Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers
title_full_unstemmed Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers
title_short Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights (Euphrasia) Revealed by k-mers
title_sort measuring the invisible: the sequences causal of genome size differences in eyebrights (euphrasia) revealed by k-mers
topic Plant Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9372453/
https://www.ncbi.nlm.nih.gov/pubmed/35968114
http://dx.doi.org/10.3389/fpls.2022.818410
work_keys_str_mv AT becherhannes measuringtheinvisiblethesequencescausalofgenomesizedifferencesineyebrightseuphrasiarevealedbykmers
AT sampsonjacob measuringtheinvisiblethesequencescausalofgenomesizedifferencesineyebrightseuphrasiarevealedbykmers
AT twyfordalexd measuringtheinvisiblethesequencescausalofgenomesizedifferencesineyebrightseuphrasiarevealedbykmers