Cargando…

Repetitive Elements May Comprise Over Two-Thirds of the Human Genome

Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de n...

Descripción completa

Detalles Bibliográficos
Autores principales: de Koning, A. P. Jason, Gu, Wanjun, Castoe, Todd A., Batzer, Mark A., Pollock, David D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228813/
https://www.ncbi.nlm.nih.gov/pubmed/22144907
http://dx.doi.org/10.1371/journal.pgen.1002384
_version_ 1782217877523267584
author de Koning, A. P. Jason
Gu, Wanjun
Castoe, Todd A.
Batzer, Mark A.
Pollock, David D.
author_facet de Koning, A. P. Jason
Gu, Wanjun
Castoe, Todd A.
Batzer, Mark A.
Pollock, David D.
author_sort de Koning, A. P. Jason
collection PubMed
description Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo “clouds”). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%–69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed “element-specific” P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.
format Online
Article
Text
id pubmed-3228813
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32288132011-12-05 Repetitive Elements May Comprise Over Two-Thirds of the Human Genome de Koning, A. P. Jason Gu, Wanjun Castoe, Todd A. Batzer, Mark A. Pollock, David D. PLoS Genet Research Article Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo “clouds”). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%–69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed “element-specific” P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed. Public Library of Science 2011-12-01 /pmc/articles/PMC3228813/ /pubmed/22144907 http://dx.doi.org/10.1371/journal.pgen.1002384 Text en de Koning et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
de Koning, A. P. Jason
Gu, Wanjun
Castoe, Todd A.
Batzer, Mark A.
Pollock, David D.
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
title Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
title_full Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
title_fullStr Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
title_full_unstemmed Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
title_short Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
title_sort repetitive elements may comprise over two-thirds of the human genome
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228813/
https://www.ncbi.nlm.nih.gov/pubmed/22144907
http://dx.doi.org/10.1371/journal.pgen.1002384
work_keys_str_mv AT dekoningapjason repetitiveelementsmaycompriseovertwothirdsofthehumangenome
AT guwanjun repetitiveelementsmaycompriseovertwothirdsofthehumangenome
AT castoetodda repetitiveelementsmaycompriseovertwothirdsofthehumangenome
AT batzermarka repetitiveelementsmaycompriseovertwothirdsofthehumangenome
AT pollockdavidd repetitiveelementsmaycompriseovertwothirdsofthehumangenome