Cargando…

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

BACKGROUND: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thous...

Descripción completa

Detalles Bibliográficos
Autores principales: Kratochvíl, Miroslav, Hunewald, Oliver, Heirendt, Laurent, Verissimo, Vasco, Vondrášek, Jiří, Satagopam, Venkata P, Schneider, Reinhard, Trefois, Christophe, Ollert, Markus
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7672468/
https://www.ncbi.nlm.nih.gov/pubmed/33205814
http://dx.doi.org/10.1093/gigascience/giaa127
_version_ 1783611143122583552
author Kratochvíl, Miroslav
Hunewald, Oliver
Heirendt, Laurent
Verissimo, Vasco
Vondrášek, Jiří
Satagopam, Venkata P
Schneider, Reinhard
Trefois, Christophe
Ollert, Markus
author_facet Kratochvíl, Miroslav
Hunewald, Oliver
Heirendt, Laurent
Verissimo, Vasco
Vondrášek, Jiří
Satagopam, Venkata P
Schneider, Reinhard
Trefois, Christophe
Ollert, Markus
author_sort Kratochvíl, Miroslav
collection PubMed
description BACKGROUND: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. RESULTS: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. CONCLUSIONS: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.
format Online
Article
Text
id pubmed-7672468
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-76724682020-11-24 GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets Kratochvíl, Miroslav Hunewald, Oliver Heirendt, Laurent Verissimo, Vasco Vondrášek, Jiří Satagopam, Venkata P Schneider, Reinhard Trefois, Christophe Ollert, Markus Gigascience Technical Note BACKGROUND: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. RESULTS: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. CONCLUSIONS: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies. Oxford University Press 2020-11-18 /pmc/articles/PMC7672468/ /pubmed/33205814 http://dx.doi.org/10.1093/gigascience/giaa127 Text en © The Author(s) 2020. Published by Oxford University Press GigaScience. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Kratochvíl, Miroslav
Hunewald, Oliver
Heirendt, Laurent
Verissimo, Vasco
Vondrášek, Jiří
Satagopam, Venkata P
Schneider, Reinhard
Trefois, Christophe
Ollert, Markus
GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
title GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
title_full GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
title_fullStr GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
title_full_unstemmed GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
title_short GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets
title_sort gigasom.jl: high-performance clustering and visualization of huge cytometry datasets
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7672468/
https://www.ncbi.nlm.nih.gov/pubmed/33205814
http://dx.doi.org/10.1093/gigascience/giaa127
work_keys_str_mv AT kratochvilmiroslav gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT hunewaldoliver gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT heirendtlaurent gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT verissimovasco gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT vondrasekjiri gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT satagopamvenkatap gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT schneiderreinhard gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT trefoischristophe gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets
AT ollertmarkus gigasomjlhighperformanceclusteringandvisualizationofhugecytometrydatasets