Cargando…

Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns

BACKGROUND: With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cann...

Descripción completa

Detalles Bibliográficos
Autores principales: Comin, Matteo, Schimd, Michele
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168702/
https://www.ncbi.nlm.nih.gov/pubmed/25252700
http://dx.doi.org/10.1186/1471-2105-15-S9-S1
_version_ 1782335602033688576
author Comin, Matteo
Schimd, Michele
author_facet Comin, Matteo
Schimd, Michele
author_sort Comin, Matteo
collection PubMed
description BACKGROUND: With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cannot be mapped onto a reference genome alignment-based methods are not applicable. However it is still possible to study the evolutionary relationship of unassembled genomes based on NGS data. RESULTS: We present a parameter-free alignment-free method, called [Formula: see text] , based on variable-length patterns, for the direct comparison of sets of NGS reads. We define a similarity measure using variable-length patterns, as well as reverses and reverse-complements, along with their statistical and syntactical properties. We evaluate several alignment-free statistics on the comparison of NGS reads coming from simulated and real genomes. In almost all simulations our method [Formula: see text] outperforms all other statistics. The performance gain becomes more evident when real genomes are used. CONCLUSION: The new alignment-free statistic is highly successful in discriminating related genomes based on NGS reads data. In almost all experiments, it outperforms traditional alignment-free statistics that are based on fixed length patterns.
format Online
Article
Text
id pubmed-4168702
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-41687022014-10-02 Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns Comin, Matteo Schimd, Michele BMC Bioinformatics Proceedings BACKGROUND: With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cannot be mapped onto a reference genome alignment-based methods are not applicable. However it is still possible to study the evolutionary relationship of unassembled genomes based on NGS data. RESULTS: We present a parameter-free alignment-free method, called [Formula: see text] , based on variable-length patterns, for the direct comparison of sets of NGS reads. We define a similarity measure using variable-length patterns, as well as reverses and reverse-complements, along with their statistical and syntactical properties. We evaluate several alignment-free statistics on the comparison of NGS reads coming from simulated and real genomes. In almost all simulations our method [Formula: see text] outperforms all other statistics. The performance gain becomes more evident when real genomes are used. CONCLUSION: The new alignment-free statistic is highly successful in discriminating related genomes based on NGS reads data. In almost all experiments, it outperforms traditional alignment-free statistics that are based on fixed length patterns. BioMed Central 2014-09-10 /pmc/articles/PMC4168702/ /pubmed/25252700 http://dx.doi.org/10.1186/1471-2105-15-S9-S1 Text en Copyright © 2014 Comin and Schimd; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Comin, Matteo
Schimd, Michele
Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns
title Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns
title_full Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns
title_fullStr Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns
title_full_unstemmed Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns
title_short Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns
title_sort assembly-free genome comparison based on next-generation sequencing reads and variable length patterns
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168702/
https://www.ncbi.nlm.nih.gov/pubmed/25252700
http://dx.doi.org/10.1186/1471-2105-15-S9-S1
work_keys_str_mv AT cominmatteo assemblyfreegenomecomparisonbasedonnextgenerationsequencingreadsandvariablelengthpatterns
AT schimdmichele assemblyfreegenomecomparisonbasedonnextgenerationsequencingreadsandvariablelengthpatterns