Cargando…

Finding and extending ancient simple sequence repeat-derived regions in the human genome

BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regio...

Descripción completa

Detalles Bibliográficos
Autores principales: Shortt, Jonathan A., Ruggiero, Robert P., Cox, Corey, Wacholder, Aaron C., Pollock, David D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7027126/
https://www.ncbi.nlm.nih.gov/pubmed/32095164
http://dx.doi.org/10.1186/s13100-020-00206-y
_version_ 1783498805986983936
author Shortt, Jonathan A.
Ruggiero, Robert P.
Cox, Corey
Wacholder, Aaron C.
Pollock, David D.
author_facet Shortt, Jonathan A.
Ruggiero, Robert P.
Cox, Corey
Wacholder, Aaron C.
Pollock, David D.
author_sort Shortt, Jonathan A.
collection PubMed
description BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. RESULTS: The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified ‘SSR-clouds’, groups of similar kmers (or ‘oligos’) that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. CONCLUSIONS: Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A’s annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure.
format Online
Article
Text
id pubmed-7027126
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-70271262020-02-24 Finding and extending ancient simple sequence repeat-derived regions in the human genome Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D. Mob DNA Research BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. RESULTS: The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified ‘SSR-clouds’, groups of similar kmers (or ‘oligos’) that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. CONCLUSIONS: Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A’s annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure. BioMed Central 2020-02-17 /pmc/articles/PMC7027126/ /pubmed/32095164 http://dx.doi.org/10.1186/s13100-020-00206-y Text en © The Author(s) 2020 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Shortt, Jonathan A.
Ruggiero, Robert P.
Cox, Corey
Wacholder, Aaron C.
Pollock, David D.
Finding and extending ancient simple sequence repeat-derived regions in the human genome
title Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_full Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_fullStr Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_full_unstemmed Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_short Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_sort finding and extending ancient simple sequence repeat-derived regions in the human genome
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7027126/
https://www.ncbi.nlm.nih.gov/pubmed/32095164
http://dx.doi.org/10.1186/s13100-020-00206-y
work_keys_str_mv AT shorttjonathana findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome
AT ruggierorobertp findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome
AT coxcorey findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome
AT wacholderaaronc findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome
AT pollockdavidd findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome