Cargando…

A highly efficient multi-core algorithm for clustering extremely large datasets

BACKGROUND: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kraus, Johann M, Kestler, Hans A
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2865495/ https://www.ncbi.nlm.nih.gov/pubmed/20370922 http://dx.doi.org/10.1186/1471-2105-11-169

_version_	1782180843604672512
author	Kraus, Johann M Kestler, Hans A
author_facet	Kraus, Johann M Kestler, Hans A
author_sort	Kraus, Johann M
collection	PubMed
description	BACKGROUND: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. RESULTS: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. CONCLUSIONS: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.
format	Text
id	pubmed-2865495
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28654952010-05-07 A highly efficient multi-core algorithm for clustering extremely large datasets Kraus, Johann M Kestler, Hans A BMC Bioinformatics Software BACKGROUND: In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. RESULTS: We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. CONCLUSIONS: Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. BioMed Central 2010-04-06 /pmc/articles/PMC2865495/ /pubmed/20370922 http://dx.doi.org/10.1186/1471-2105-11-169 Text en Copyright ©2010 Kraus and Kestler; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Kraus, Johann M Kestler, Hans A A highly efficient multi-core algorithm for clustering extremely large datasets
title	A highly efficient multi-core algorithm for clustering extremely large datasets
title_full	A highly efficient multi-core algorithm for clustering extremely large datasets
title_fullStr	A highly efficient multi-core algorithm for clustering extremely large datasets
title_full_unstemmed	A highly efficient multi-core algorithm for clustering extremely large datasets
title_short	A highly efficient multi-core algorithm for clustering extremely large datasets
title_sort	highly efficient multi-core algorithm for clustering extremely large datasets
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2865495/ https://www.ncbi.nlm.nih.gov/pubmed/20370922 http://dx.doi.org/10.1186/1471-2105-11-169
work_keys_str_mv	AT krausjohannm ahighlyefficientmulticorealgorithmforclusteringextremelylargedatasets AT kestlerhansa ahighlyefficientmulticorealgorithmforclusteringextremelylargedatasets AT krausjohannm highlyefficientmulticorealgorithmforclusteringextremelylargedatasets AT kestlerhansa highlyefficientmulticorealgorithmforclusteringextremelylargedatasets

A highly efficient multi-core algorithm for clustering extremely large datasets

Ejemplares similares