Cargando…
Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
BACKGROUND: Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Th...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4057587/ https://www.ncbi.nlm.nih.gov/pubmed/24886411 http://dx.doi.org/10.1186/1756-0500-7-320 |
_version_ | 1782320991104401408 |
---|---|
author | Tran, Ngoc Hieu Chen, Xin |
author_facet | Tran, Ngoc Hieu Chen, Xin |
author_sort | Tran, Ngoc Hieu |
collection | PubMed |
description | BACKGROUND: Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads. RESULTS: Recently several k-mer based distance measures such as CVTree, [Formula: see text] , and co-phylog have been proposed or enhanced to address this problem. However, how to choose an optimal k value for those distance measures is not trivial since it may depend on different aspects of the sequence data. In this paper, we considered an alternative parameter-free approach: compression-based distance measures. These measures have shown good performance for the comparison of long genomic sequences, but they have not yet been tested on NGS short reads. Hence, we performed extensive validation in this study and showed that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the multiple sequence alignment approach, and from existing benchmarks in the literature. Moreover, as the compression-based distance measures are parameter-free, no parameter optimization is required and these measures still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels. CONCLUSIONS: The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genomic sequences as well as the comparison of NGS samples of short reads. |
format | Online Article Text |
id | pubmed-4057587 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-40575872014-06-23 Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction Tran, Ngoc Hieu Chen, Xin BMC Res Notes Research Article BACKGROUND: Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads. RESULTS: Recently several k-mer based distance measures such as CVTree, [Formula: see text] , and co-phylog have been proposed or enhanced to address this problem. However, how to choose an optimal k value for those distance measures is not trivial since it may depend on different aspects of the sequence data. In this paper, we considered an alternative parameter-free approach: compression-based distance measures. These measures have shown good performance for the comparison of long genomic sequences, but they have not yet been tested on NGS short reads. Hence, we performed extensive validation in this study and showed that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the multiple sequence alignment approach, and from existing benchmarks in the literature. Moreover, as the compression-based distance measures are parameter-free, no parameter optimization is required and these measures still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels. CONCLUSIONS: The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genomic sequences as well as the comparison of NGS samples of short reads. BioMed Central 2014-05-29 /pmc/articles/PMC4057587/ /pubmed/24886411 http://dx.doi.org/10.1186/1756-0500-7-320 Text en Copyright © 2014 Tran and Chen; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Tran, Ngoc Hieu Chen, Xin Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction |
title | Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction |
title_full | Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction |
title_fullStr | Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction |
title_full_unstemmed | Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction |
title_short | Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction |
title_sort | comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4057587/ https://www.ncbi.nlm.nih.gov/pubmed/24886411 http://dx.doi.org/10.1186/1756-0500-7-320 |
work_keys_str_mv | AT tranngochieu comparisonofnextgenerationsequencingsamplesusingcompressionbaseddistancesanditsapplicationtophylogeneticreconstruction AT chenxin comparisonofnextgenerationsequencingsamplesusingcompressionbaseddistancesanditsapplicationtophylogeneticreconstruction |