Cargando…

Next generation sequencing reads comparison with an alignment-free distance

BACKGROUND: Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis. METHODS: We propose a m...

Descripción completa

Detalles Bibliográficos
Autores principales:	Weitschek, Emanuel, Santoni, Daniele, Fiscon, Giulia, De Cola, Maria Cristina, Bertolazzi, Paola, Felici, Giovanni
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4265526/ https://www.ncbi.nlm.nih.gov/pubmed/25465386 http://dx.doi.org/10.1186/1756-0500-7-869

_version_	1782348908045795328
author	Weitschek, Emanuel Santoni, Daniele Fiscon, Giulia De Cola, Maria Cristina Bertolazzi, Paola Felici, Giovanni
author_facet	Weitschek, Emanuel Santoni, Daniele Fiscon, Giulia De Cola, Maria Cristina Bertolazzi, Paola Felici, Giovanni
author_sort	Weitschek, Emanuel
collection	PubMed
description	BACKGROUND: Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis. METHODS: We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples. RESULTS: We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment. CONCLUSIONS: Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1756-0500-7-869) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4265526
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42655262014-12-15 Next generation sequencing reads comparison with an alignment-free distance Weitschek, Emanuel Santoni, Daniele Fiscon, Giulia De Cola, Maria Cristina Bertolazzi, Paola Felici, Giovanni BMC Res Notes Research Article BACKGROUND: Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis. METHODS: We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples. RESULTS: We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment. CONCLUSIONS: Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1756-0500-7-869) contains supplementary material, which is available to authorized users. BioMed Central 2014-12-03 /pmc/articles/PMC4265526/ /pubmed/25465386 http://dx.doi.org/10.1186/1756-0500-7-869 Text en © Weitschek et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Weitschek, Emanuel Santoni, Daniele Fiscon, Giulia De Cola, Maria Cristina Bertolazzi, Paola Felici, Giovanni Next generation sequencing reads comparison with an alignment-free distance
title	Next generation sequencing reads comparison with an alignment-free distance
title_full	Next generation sequencing reads comparison with an alignment-free distance
title_fullStr	Next generation sequencing reads comparison with an alignment-free distance
title_full_unstemmed	Next generation sequencing reads comparison with an alignment-free distance
title_short	Next generation sequencing reads comparison with an alignment-free distance
title_sort	next generation sequencing reads comparison with an alignment-free distance
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4265526/ https://www.ncbi.nlm.nih.gov/pubmed/25465386 http://dx.doi.org/10.1186/1756-0500-7-869
work_keys_str_mv	AT weitschekemanuel nextgenerationsequencingreadscomparisonwithanalignmentfreedistance AT santonidaniele nextgenerationsequencingreadscomparisonwithanalignmentfreedistance AT fiscongiulia nextgenerationsequencingreadscomparisonwithanalignmentfreedistance AT decolamariacristina nextgenerationsequencingreadscomparisonwithanalignmentfreedistance AT bertolazzipaola nextgenerationsequencingreadscomparisonwithanalignmentfreedistance AT felicigiovanni nextgenerationsequencingreadscomparisonwithanalignmentfreedistance

Next generation sequencing reads comparison with an alignment-free distance

Ejemplares similares