Cargando…

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sharma, Ashok, Podolsky, Robert, Zhao, Jieping, McIndoe, Richard A.
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2009
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2672630/ https://www.ncbi.nlm.nih.gov/pubmed/19261720 http://dx.doi.org/10.1093/bioinformatics/btp123

_version_	1782166549007695872
author	Sharma, Ashok Podolsky, Robert Zhao, Jieping McIndoe, Richard A.
author_facet	Sharma, Ashok Podolsky, Robert Zhao, Jieping McIndoe, Richard A.
author_sort	Sharma, Ashok
collection	PubMed
description	Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). Availability: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. Contact: rmcindoe@mail.mcg.edu Supplementary information: Supplementary data are available at Bioinformatics online.
format	Text
id	pubmed-2672630
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-26726302009-04-29 A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Sharma, Ashok Podolsky, Robert Zhao, Jieping McIndoe, Richard A. Bioinformatics Original Papers Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). Availability: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. Contact: rmcindoe@mail.mcg.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2009-05-01 2009-03-04 /pmc/articles/PMC2672630/ /pubmed/19261720 http://dx.doi.org/10.1093/bioinformatics/btp123 Text en © 2009 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Sharma, Ashok Podolsky, Robert Zhao, Jieping McIndoe, Richard A. A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
title	A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
title_full	A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
title_fullStr	A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
title_full_unstemmed	A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
title_short	A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
title_sort	modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2672630/ https://www.ncbi.nlm.nih.gov/pubmed/19261720 http://dx.doi.org/10.1093/bioinformatics/btp123
work_keys_str_mv	AT sharmaashok amodifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets AT podolskyrobert amodifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets AT zhaojieping amodifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets AT mcindoericharda amodifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets AT sharmaashok modifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets AT podolskyrobert modifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets AT zhaojieping modifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets AT mcindoericharda modifiedhyperplaneclusteringalgorithmallowsforefficientandaccurateclusteringofextremelylargedatasets

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets

Ejemplares similares