Cargando…

K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationall...

Descripción completa

Detalles Bibliográficos
Autores principales: Contreras-Moreira, Bruno, Filippi, Carla V, Naamati, Guy, Girón, Carlos García, Allen, James E, Flicek, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7614178/
https://www.ncbi.nlm.nih.gov/pubmed/34562304
http://dx.doi.org/10.1002/tpg2.20143
_version_ 1783605573905809408
author Contreras-Moreira, Bruno
Filippi, Carla V
Naamati, Guy
Girón, Carlos García
Allen, James E
Flicek, Paul
author_facet Contreras-Moreira, Bruno
Filippi, Carla V
Naamati, Guy
Girón, Carlos García
Allen, James E
Flicek, Paul
author_sort Contreras-Moreira, Bruno
collection PubMed
description The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
format Online
Article
Text
id pubmed-7614178
institution National Center for Biotechnology Information
language English
publishDate 2021
record_format MEDLINE/PubMed
spelling pubmed-76141782023-02-14 K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes Contreras-Moreira, Bruno Filippi, Carla V Naamati, Guy Girón, Carlos García Allen, James E Flicek, Paul Plant Genome Article The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts. 2021-11-01 2021-09-25 /pmc/articles/PMC7614178/ /pubmed/34562304 http://dx.doi.org/10.1002/tpg2.20143 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) International license.
spellingShingle Article
Contreras-Moreira, Bruno
Filippi, Carla V
Naamati, Guy
Girón, Carlos García
Allen, James E
Flicek, Paul
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_fullStr K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full_unstemmed K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_short K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_sort k-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7614178/
https://www.ncbi.nlm.nih.gov/pubmed/34562304
http://dx.doi.org/10.1002/tpg2.20143
work_keys_str_mv AT contrerasmoreirabruno kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT filippicarlav kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT naamatiguy kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT gironcarlosgarcia kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT allenjamese kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT flicekpaul kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes