Cargando…

Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

BACKGROUND: Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hughes, Adam, Ruan, Yang, Ekanayake, Saliya, Bae, Seung-Hee, Dong, Qunfeng, Rho, Mina, Qiu, Judy, Fox, Geoffrey
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305784/ https://www.ncbi.nlm.nih.gov/pubmed/22536872 http://dx.doi.org/10.1186/1471-2105-13-S2-S9

_version_	1782227147520212992
author	Hughes, Adam Ruan, Yang Ekanayake, Saliya Bae, Seung-Hee Dong, Qunfeng Rho, Mina Qiu, Judy Fox, Geoffrey
author_facet	Hughes, Adam Ruan, Yang Ekanayake, Saliya Bae, Seung-Hee Dong, Qunfeng Rho, Mina Qiu, Judy Fox, Geoffrey
author_sort	Hughes, Adam
collection	PubMed
description	BACKGROUND: Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets. METHODS: Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters. RESULTS: This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS. CONCLUSIONS: Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.
format	Online Article Text
id	pubmed-3305784
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-33057842012-03-16 Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets Hughes, Adam Ruan, Yang Ekanayake, Saliya Bae, Seung-Hee Dong, Qunfeng Rho, Mina Qiu, Judy Fox, Geoffrey BMC Bioinformatics Proceedings BACKGROUND: Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets. METHODS: Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters. RESULTS: This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS. CONCLUSIONS: Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future. BioMed Central 2012-03-13 /pmc/articles/PMC3305784/ /pubmed/22536872 http://dx.doi.org/10.1186/1471-2105-13-S2-S9 Text en Copyright ©2012 Hughes et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Hughes, Adam Ruan, Yang Ekanayake, Saliya Bae, Seung-Hee Dong, Qunfeng Rho, Mina Qiu, Judy Fox, Geoffrey Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
title	Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
title_full	Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
title_fullStr	Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
title_full_unstemmed	Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
title_short	Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
title_sort	interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305784/ https://www.ncbi.nlm.nih.gov/pubmed/22536872 http://dx.doi.org/10.1186/1471-2105-13-S2-S9
work_keys_str_mv	AT hughesadam interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets AT ruanyang interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets AT ekanayakesaliya interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets AT baeseunghee interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets AT dongqunfeng interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets AT rhomina interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets AT qiujudy interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets AT foxgeoffrey interpolativemultidimensionalscalingtechniquesfortheidentificationofclustersinverylargesequencesets

Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

Ejemplares similares