Cargando…

Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns

BACKGROUND: With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cann...

Descripción completa

Detalles Bibliográficos
Autores principales: Comin, Matteo, Schimd, Michele
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168702/
https://www.ncbi.nlm.nih.gov/pubmed/25252700
http://dx.doi.org/10.1186/1471-2105-15-S9-S1
Descripción
Sumario:BACKGROUND: With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cannot be mapped onto a reference genome alignment-based methods are not applicable. However it is still possible to study the evolutionary relationship of unassembled genomes based on NGS data. RESULTS: We present a parameter-free alignment-free method, called [Formula: see text] , based on variable-length patterns, for the direct comparison of sets of NGS reads. We define a similarity measure using variable-length patterns, as well as reverses and reverse-complements, along with their statistical and syntactical properties. We evaluate several alignment-free statistics on the comparison of NGS reads coming from simulated and real genomes. In almost all simulations our method [Formula: see text] outperforms all other statistics. The performance gain becomes more evident when real genomes are used. CONCLUSION: The new alignment-free statistic is highly successful in discriminating related genomes based on NGS reads data. In almost all experiments, it outperforms traditional alignment-free statistics that are based on fixed length patterns.