Cargando…
Finding and extending ancient simple sequence repeat-derived regions in the human genome
BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regio...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7027126/ https://www.ncbi.nlm.nih.gov/pubmed/32095164 http://dx.doi.org/10.1186/s13100-020-00206-y |
_version_ | 1783498805986983936 |
---|---|
author | Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D. |
author_facet | Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D. |
author_sort | Shortt, Jonathan A. |
collection | PubMed |
description | BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. RESULTS: The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified ‘SSR-clouds’, groups of similar kmers (or ‘oligos’) that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. CONCLUSIONS: Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A’s annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure. |
format | Online Article Text |
id | pubmed-7027126 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-70271262020-02-24 Finding and extending ancient simple sequence repeat-derived regions in the human genome Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D. Mob DNA Research BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. RESULTS: The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified ‘SSR-clouds’, groups of similar kmers (or ‘oligos’) that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. CONCLUSIONS: Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A’s annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure. BioMed Central 2020-02-17 /pmc/articles/PMC7027126/ /pubmed/32095164 http://dx.doi.org/10.1186/s13100-020-00206-y Text en © The Author(s) 2020 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D. Finding and extending ancient simple sequence repeat-derived regions in the human genome |
title | Finding and extending ancient simple sequence repeat-derived regions in the human genome |
title_full | Finding and extending ancient simple sequence repeat-derived regions in the human genome |
title_fullStr | Finding and extending ancient simple sequence repeat-derived regions in the human genome |
title_full_unstemmed | Finding and extending ancient simple sequence repeat-derived regions in the human genome |
title_short | Finding and extending ancient simple sequence repeat-derived regions in the human genome |
title_sort | finding and extending ancient simple sequence repeat-derived regions in the human genome |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7027126/ https://www.ncbi.nlm.nih.gov/pubmed/32095164 http://dx.doi.org/10.1186/s13100-020-00206-y |
work_keys_str_mv | AT shorttjonathana findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT ruggierorobertp findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT coxcorey findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT wacholderaaronc findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT pollockdavidd findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome |