Cargando…

Mappability and read length

Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference s...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Wentian, Freudenberg, Jan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4226227/
https://www.ncbi.nlm.nih.gov/pubmed/25426137
http://dx.doi.org/10.3389/fgene.2014.00381
_version_ 1782343599748284416
author Li, Wentian
Freudenberg, Jan
author_facet Li, Wentian
Freudenberg, Jan
author_sort Li, Wentian
collection PubMed
description Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 10(4) bases, or 10(5) − 10(6) bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of 10(3) bases. With a read length of 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1 kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37/hg19). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move toward the direction of 100% mappability.
format Online
Article
Text
id pubmed-4226227
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-42262272014-11-25 Mappability and read length Li, Wentian Freudenberg, Jan Front Genet Genetics Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 10(4) bases, or 10(5) − 10(6) bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of 10(3) bases. With a read length of 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1 kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37/hg19). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move toward the direction of 100% mappability. Frontiers Media S.A. 2014-11-10 /pmc/articles/PMC4226227/ /pubmed/25426137 http://dx.doi.org/10.3389/fgene.2014.00381 Text en Copyright © 2014 Li and Freudenberg. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Li, Wentian
Freudenberg, Jan
Mappability and read length
title Mappability and read length
title_full Mappability and read length
title_fullStr Mappability and read length
title_full_unstemmed Mappability and read length
title_short Mappability and read length
title_sort mappability and read length
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4226227/
https://www.ncbi.nlm.nih.gov/pubmed/25426137
http://dx.doi.org/10.3389/fgene.2014.00381
work_keys_str_mv AT liwentian mappabilityandreadlength
AT freudenbergjan mappabilityandreadlength