Cargando…

Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome

BACKGROUND: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of th...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Wentian, Freudenberg, Jan, Miramontes, Pedro
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3927684/
https://www.ncbi.nlm.nih.gov/pubmed/24386976
http://dx.doi.org/10.1186/1471-2105-15-2
_version_ 1782304165397004288
author Li, Wentian
Freudenberg, Jan
Miramontes, Pedro
author_facet Li, Wentian
Freudenberg, Jan
Miramontes, Pedro
author_sort Li, Wentian
collection PubMed
description BACKGROUND: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. RESULTS: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. CONCLUSION: Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
format Online
Article
Text
id pubmed-3927684
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-39276842014-03-05 Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome Li, Wentian Freudenberg, Jan Miramontes, Pedro BMC Bioinformatics Research Article BACKGROUND: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. RESULTS: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. CONCLUSION: Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome. BioMed Central 2014-01-03 /pmc/articles/PMC3927684/ /pubmed/24386976 http://dx.doi.org/10.1186/1471-2105-15-2 Text en Copyright © 2014 Li et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Li, Wentian
Freudenberg, Jan
Miramontes, Pedro
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
title Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
title_full Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
title_fullStr Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
title_full_unstemmed Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
title_short Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
title_sort diminishing return for increased mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3927684/
https://www.ncbi.nlm.nih.gov/pubmed/24386976
http://dx.doi.org/10.1186/1471-2105-15-2
work_keys_str_mv AT liwentian diminishingreturnforincreasedmappabilitywithlongersequencingreadsimplicationsofthekmerdistributionsinthehumangenome
AT freudenbergjan diminishingreturnforincreasedmappabilitywithlongersequencingreadsimplicationsofthekmerdistributionsinthehumangenome
AT miramontespedro diminishingreturnforincreasedmappabilitywithlongersequencingreadsimplicationsofthekmerdistributionsinthehumangenome