Cargando…

MeShClust: an intelligent tool for clustering DNA sequences

Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a clust...

Descripción completa

Detalles Bibliográficos
Autores principales:	James, Benjamin T, Luczak, Brian B, Girgis, Hani Z
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Methods Online
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101578/ https://www.ncbi.nlm.nih.gov/pubmed/29718317 http://dx.doi.org/10.1093/nar/gky315

_version_	1783349043933478912
author	James, Benjamin T Luczak, Brian B Girgis, Hani Z
author_facet	James, Benjamin T Luczak, Brian B Girgis, Hani Z
author_sort	James, Benjamin T
collection	PubMed
description	Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust’s ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.
format	Online Article Text
id	pubmed-6101578
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-61015782018-08-27 MeShClust: an intelligent tool for clustering DNA sequences James, Benjamin T Luczak, Brian B Girgis, Hani Z Nucleic Acids Res Methods Online Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust’s ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate. Oxford University Press 2018-08-21 2018-05-01 /pmc/articles/PMC6101578/ /pubmed/29718317 http://dx.doi.org/10.1093/nar/gky315 Text en © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Methods Online James, Benjamin T Luczak, Brian B Girgis, Hani Z MeShClust: an intelligent tool for clustering DNA sequences
title	MeShClust: an intelligent tool for clustering DNA sequences
title_full	MeShClust: an intelligent tool for clustering DNA sequences
title_fullStr	MeShClust: an intelligent tool for clustering DNA sequences
title_full_unstemmed	MeShClust: an intelligent tool for clustering DNA sequences
title_short	MeShClust: an intelligent tool for clustering DNA sequences
title_sort	meshclust: an intelligent tool for clustering dna sequences
topic	Methods Online
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101578/ https://www.ncbi.nlm.nih.gov/pubmed/29718317 http://dx.doi.org/10.1093/nar/gky315
work_keys_str_mv	AT jamesbenjamint meshclustanintelligenttoolforclusteringdnasequences AT luczakbrianb meshclustanintelligenttoolforclusteringdnasequences AT girgishaniz meshclustanintelligenttoolforclusteringdnasequences

MeShClust: an intelligent tool for clustering DNA sequences

Ejemplares similares