Cargando…
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
BACKGROUND: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of th...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3927684/ https://www.ncbi.nlm.nih.gov/pubmed/24386976 http://dx.doi.org/10.1186/1471-2105-15-2 |
_version_ | 1782304165397004288 |
---|---|
author | Li, Wentian Freudenberg, Jan Miramontes, Pedro |
author_facet | Li, Wentian Freudenberg, Jan Miramontes, Pedro |
author_sort | Li, Wentian |
collection | PubMed |
description | BACKGROUND: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. RESULTS: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. CONCLUSION: Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome. |
format | Online Article Text |
id | pubmed-3927684 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-39276842014-03-05 Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome Li, Wentian Freudenberg, Jan Miramontes, Pedro BMC Bioinformatics Research Article BACKGROUND: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. RESULTS: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. CONCLUSION: Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome. BioMed Central 2014-01-03 /pmc/articles/PMC3927684/ /pubmed/24386976 http://dx.doi.org/10.1186/1471-2105-15-2 Text en Copyright © 2014 Li et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Li, Wentian Freudenberg, Jan Miramontes, Pedro Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome |
title | Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome |
title_full | Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome |
title_fullStr | Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome |
title_full_unstemmed | Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome |
title_short | Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome |
title_sort | diminishing return for increased mappability with longer sequencing reads: implications of the k-mer distributions in the human genome |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3927684/ https://www.ncbi.nlm.nih.gov/pubmed/24386976 http://dx.doi.org/10.1186/1471-2105-15-2 |
work_keys_str_mv | AT liwentian diminishingreturnforincreasedmappabilitywithlongersequencingreadsimplicationsofthekmerdistributionsinthehumangenome AT freudenbergjan diminishingreturnforincreasedmappabilitywithlongersequencingreadsimplicationsofthekmerdistributionsinthehumangenome AT miramontespedro diminishingreturnforincreasedmappabilitywithlongersequencingreadsimplicationsofthekmerdistributionsinthehumangenome |