Cargando…

Centroid based clustering of high throughput sequencing reads based on n-mer counts

BACKGROUND: Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Solovyov, Alexander, Lipkin, W Ian
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848435/ https://www.ncbi.nlm.nih.gov/pubmed/24011402 http://dx.doi.org/10.1186/1471-2105-14-268

_version_	1782293757054418944
author	Solovyov, Alexander Lipkin, W Ian
author_facet	Solovyov, Alexander Lipkin, W Ian
author_sort	Solovyov, Alexander
collection	PubMed
description	BACKGROUND: Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering. RESULTS: We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with or without the data whitening is efficient from the computational point of view. A higher clustering accuracy can be achieved using the soft expectation maximization method, whereby each sequence is attributed to each cluster with a specific probability. We implement an open source tool for alignment-free clustering. It is publicly available from github: https://github.com/luscinius/afcluster. CONCLUSIONS: We show the utility of alignment-free sequence clustering for high throughput sequencing analysis despite its limitations. In particular, it allows one to perform assembly with reduced resources and a minimal loss of quality. The major factor affecting performance of alignment-free read clustering is the length of the read.
format	Online Article Text
id	pubmed-3848435
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-38484352013-12-05 Centroid based clustering of high throughput sequencing reads based on n-mer counts Solovyov, Alexander Lipkin, W Ian BMC Bioinformatics Research Article BACKGROUND: Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering. RESULTS: We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with or without the data whitening is efficient from the computational point of view. A higher clustering accuracy can be achieved using the soft expectation maximization method, whereby each sequence is attributed to each cluster with a specific probability. We implement an open source tool for alignment-free clustering. It is publicly available from github: https://github.com/luscinius/afcluster. CONCLUSIONS: We show the utility of alignment-free sequence clustering for high throughput sequencing analysis despite its limitations. In particular, it allows one to perform assembly with reduced resources and a minimal loss of quality. The major factor affecting performance of alignment-free read clustering is the length of the read. BioMed Central 2013-09-08 /pmc/articles/PMC3848435/ /pubmed/24011402 http://dx.doi.org/10.1186/1471-2105-14-268 Text en Copyright © 2013 Solovyov and Lipkin; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Solovyov, Alexander Lipkin, W Ian Centroid based clustering of high throughput sequencing reads based on n-mer counts
title	Centroid based clustering of high throughput sequencing reads based on n-mer counts
title_full	Centroid based clustering of high throughput sequencing reads based on n-mer counts
title_fullStr	Centroid based clustering of high throughput sequencing reads based on n-mer counts
title_full_unstemmed	Centroid based clustering of high throughput sequencing reads based on n-mer counts
title_short	Centroid based clustering of high throughput sequencing reads based on n-mer counts
title_sort	centroid based clustering of high throughput sequencing reads based on n-mer counts
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848435/ https://www.ncbi.nlm.nih.gov/pubmed/24011402 http://dx.doi.org/10.1186/1471-2105-14-268
work_keys_str_mv	AT solovyovalexander centroidbasedclusteringofhighthroughputsequencingreadsbasedonnmercounts AT lipkinwian centroidbasedclusteringofhighthroughputsequencingreadsbasedonnmercounts

Centroid based clustering of high throughput sequencing reads based on n-mer counts

Ejemplares similares