Cargando…

How genome complexity can explain the difficulty of aligning reads to genomes

BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence...

Descripción completa

Detalles Bibliográficos
Autores principales: Phan, Vinhthuy, Gao, Shanshan, Tran, Quang, Vo, Nam S
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4674900/
https://www.ncbi.nlm.nih.gov/pubmed/26678826
http://dx.doi.org/10.1186/1471-2105-16-S17-S3
_version_ 1782404970521296896
author Phan, Vinhthuy
Gao, Shanshan
Tran, Quang
Vo, Nam S
author_facet Phan, Vinhthuy
Gao, Shanshan
Tran, Quang
Vo, Nam S
author_sort Phan, Vinhthuy
collection PubMed
description BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. RESULTS: We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. CONCLUSIONS: We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes.
format Online
Article
Text
id pubmed-4674900
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46749002015-12-15 How genome complexity can explain the difficulty of aligning reads to genomes Phan, Vinhthuy Gao, Shanshan Tran, Quang Vo, Nam S BMC Bioinformatics Research BACKGROUND: Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship. RESULTS: We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes. CONCLUSIONS: We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes. BioMed Central 2015-12-07 /pmc/articles/PMC4674900/ /pubmed/26678826 http://dx.doi.org/10.1186/1471-2105-16-S17-S3 Text en Copyright © 2015 Phan et al. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Phan, Vinhthuy
Gao, Shanshan
Tran, Quang
Vo, Nam S
How genome complexity can explain the difficulty of aligning reads to genomes
title How genome complexity can explain the difficulty of aligning reads to genomes
title_full How genome complexity can explain the difficulty of aligning reads to genomes
title_fullStr How genome complexity can explain the difficulty of aligning reads to genomes
title_full_unstemmed How genome complexity can explain the difficulty of aligning reads to genomes
title_short How genome complexity can explain the difficulty of aligning reads to genomes
title_sort how genome complexity can explain the difficulty of aligning reads to genomes
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4674900/
https://www.ncbi.nlm.nih.gov/pubmed/26678826
http://dx.doi.org/10.1186/1471-2105-16-S17-S3
work_keys_str_mv AT phanvinhthuy howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes
AT gaoshanshan howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes
AT tranquang howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes
AT vonams howgenomecomplexitycanexplainthedifficultyofaligningreadstogenomes