Cargando…
How genome complexity can explain the difficulty of aligning reads to genomes
BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4674900/ https://www.ncbi.nlm.nih.gov/pubmed/26678826 http://dx.doi.org/10.1186/1471-2105-16-S17-S3 |
_version_ | 1782404970521296896 |
---|---|
author | Phan, Vinhthuy Gao, Shanshan Tran, Quang Vo, Nam S |
author_facet | Phan, Vinhthuy Gao, Shanshan Tran, Quang Vo, Nam S |
author_sort | Phan, Vinhthuy |
collection | PubMed |
description | BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. RESULTS: We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. CONCLUSIONS: We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes. |
format | Online Article Text |
id | pubmed-4674900 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-46749002015-12-15 How genome complexity can explain the difficulty of aligning reads to genomes Phan, Vinhthuy Gao, Shanshan Tran, Quang Vo, Nam S BMC Bioinformatics Research BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. RESULTS: We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. CONCLUSIONS: We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes. BioMed Central 2015-12-07 /pmc/articles/PMC4674900/ /pubmed/26678826 http://dx.doi.org/10.1186/1471-2105-16-S17-S3 Text en Copyright © 2015 Phan et al. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Phan, Vinhthuy Gao, Shanshan Tran, Quang Vo, Nam S How genome complexity can explain the difficulty of aligning reads to genomes |
title | How genome complexity can explain the difficulty of aligning reads to genomes |
title_full | How genome complexity can explain the difficulty of aligning reads to genomes |
title_fullStr | How genome complexity can explain the difficulty of aligning reads to genomes |
title_full_unstemmed | How genome complexity can explain the difficulty of aligning reads to genomes |
title_short | How genome complexity can explain the difficulty of aligning reads to genomes |
title_sort | how genome complexity can explain the difficulty of aligning reads to genomes |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4674900/ https://www.ncbi.nlm.nih.gov/pubmed/26678826 http://dx.doi.org/10.1186/1471-2105-16-S17-S3 |
work_keys_str_mv | AT phanvinhthuy howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes AT gaoshanshan howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes AT tranquang howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes AT vonams howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes |