Cargando…

Finding and extending ancient simple sequence repeat-derived regions in the human genome

BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regio...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shortt, Jonathan A., Ruggiero, Robert P., Cox, Corey, Wacholder, Aaron C., Pollock, David D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7027126/ https://www.ncbi.nlm.nih.gov/pubmed/32095164 http://dx.doi.org/10.1186/s13100-020-00206-y

_version_	1783498805986983936
author	Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D.
author_facet	Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D.
author_sort	Shortt, Jonathan A.
collection	PubMed
description	BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. RESULTS: The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified ‘SSR-clouds’, groups of similar kmers (or ‘oligos’) that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. CONCLUSIONS: Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A’s annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure.
format	Online Article Text
id	pubmed-7027126
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-70271262020-02-24 Finding and extending ancient simple sequence repeat-derived regions in the human genome Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D. Mob DNA Research BACKGROUND: Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs. RESULTS: The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified ‘SSR-clouds’, groups of similar kmers (or ‘oligos’) that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome. CONCLUSIONS: Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of Alu (roughly, AluJ), validating the sensitivity of the approach. Poly-A’s annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older Alus. This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure. BioMed Central 2020-02-17 /pmc/articles/PMC7027126/ /pubmed/32095164 http://dx.doi.org/10.1186/s13100-020-00206-y Text en © The Author(s) 2020 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Shortt, Jonathan A. Ruggiero, Robert P. Cox, Corey Wacholder, Aaron C. Pollock, David D. Finding and extending ancient simple sequence repeat-derived regions in the human genome
title	Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_full	Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_fullStr	Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_full_unstemmed	Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_short	Finding and extending ancient simple sequence repeat-derived regions in the human genome
title_sort	finding and extending ancient simple sequence repeat-derived regions in the human genome
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7027126/ https://www.ncbi.nlm.nih.gov/pubmed/32095164 http://dx.doi.org/10.1186/s13100-020-00206-y
work_keys_str_mv	AT shorttjonathana findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT ruggierorobertp findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT coxcorey findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT wacholderaaronc findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome AT pollockdavidd findingandextendingancientsimplesequencerepeatderivedregionsinthehumangenome

Finding and extending ancient simple sequence repeat-derived regions in the human genome

Ejemplares similares