Cargando…

EnsCat: clustering of categorical data via ensembling

BACKGROUND: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides....

Descripción completa

Detalles Bibliográficos
Autores principales:	Clarke, Bertrand S., Amiri, Saeid, Clarke, Jennifer L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5025633/ https://www.ncbi.nlm.nih.gov/pubmed/27634377 http://dx.doi.org/10.1186/s12859-016-1245-9

_version_	1782453992410841088
author	Clarke, Bertrand S. Amiri, Saeid Clarke, Jennifer L.
author_facet	Clarke, Bertrand S. Amiri, Saeid Clarke, Jennifer L.
author_sort	Clarke, Bertrand S.
collection	PubMed
description	BACKGROUND: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. RESULTS: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. CONCLUSIONS: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat.
format	Online Article Text
id	pubmed-5025633
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-50256332016-09-20 EnsCat: clustering of categorical data via ensembling Clarke, Bertrand S. Amiri, Saeid Clarke, Jennifer L. BMC Bioinformatics Software BACKGROUND: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. RESULTS: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. CONCLUSIONS: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat. BioMed Central 2016-09-15 /pmc/articles/PMC5025633/ /pubmed/27634377 http://dx.doi.org/10.1186/s12859-016-1245-9 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software Clarke, Bertrand S. Amiri, Saeid Clarke, Jennifer L. EnsCat: clustering of categorical data via ensembling
title	EnsCat: clustering of categorical data via ensembling
title_full	EnsCat: clustering of categorical data via ensembling
title_fullStr	EnsCat: clustering of categorical data via ensembling
title_full_unstemmed	EnsCat: clustering of categorical data via ensembling
title_short	EnsCat: clustering of categorical data via ensembling
title_sort	enscat: clustering of categorical data via ensembling
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5025633/ https://www.ncbi.nlm.nih.gov/pubmed/27634377 http://dx.doi.org/10.1186/s12859-016-1245-9
work_keys_str_mv	AT clarkebertrands enscatclusteringofcategoricaldataviaensembling AT amirisaeid enscatclusteringofcategoricaldataviaensembling AT clarkejenniferl enscatclusteringofcategoricaldataviaensembling

EnsCat: clustering of categorical data via ensembling

Ejemplares similares