Cargando…

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

BACKGROUND: The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on...

Descripción completa

Detalles Bibliográficos
Autores principales: Kurtz, Stefan, Narechania, Apurva, Stein, Joshua C, Ware, Doreen
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613927/
https://www.ncbi.nlm.nih.gov/pubmed/18976482
http://dx.doi.org/10.1186/1471-2164-9-517
_version_ 1782163213785235456
author Kurtz, Stefan
Narechania, Apurva
Stein, Joshua C
Ware, Doreen
author_facet Kurtz, Stefan
Narechania, Apurva
Stein, Joshua C
Ware, Doreen
author_sort Kurtz, Stefan
collection PubMed
description BACKGROUND: The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks. RESULTS: Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 10(9 )bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C(0)t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity. CONCLUSION: The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see .
format Text
id pubmed-2613927
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26139272009-01-12 A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes Kurtz, Stefan Narechania, Apurva Stein, Joshua C Ware, Doreen BMC Genomics Methodology Article BACKGROUND: The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks. RESULTS: Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 10(9 )bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C(0)t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity. CONCLUSION: The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see . BioMed Central 2008-10-31 /pmc/articles/PMC2613927/ /pubmed/18976482 http://dx.doi.org/10.1186/1471-2164-9-517 Text en Copyright © 2008 Kurtz et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kurtz, Stefan
Narechania, Apurva
Stein, Joshua C
Ware, Doreen
A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
title A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
title_full A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
title_fullStr A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
title_full_unstemmed A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
title_short A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
title_sort new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613927/
https://www.ncbi.nlm.nih.gov/pubmed/18976482
http://dx.doi.org/10.1186/1471-2164-9-517
work_keys_str_mv AT kurtzstefan anewmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes
AT narechaniaapurva anewmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes
AT steinjoshuac anewmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes
AT waredoreen anewmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes
AT kurtzstefan newmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes
AT narechaniaapurva newmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes
AT steinjoshuac newmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes
AT waredoreen newmethodtocomputekmerfrequenciesanditsapplicationtoannotatelargerepetitiveplantgenomes