Cargando…

Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data

Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to...

Descripción completa

Detalles Bibliográficos
Autores principales:	Andreev, Victor P, Gillespie, Brenda W, Helfand, Brian T, Merion, Robert M
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2016
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4982549/ https://www.ncbi.nlm.nih.gov/pubmed/27524871 http://dx.doi.org/10.4172/jpb.S14-005

_version_	1782447796835581952
author	Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M
author_facet	Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M
author_sort	Andreev, Victor P
collection	PubMed
description	Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel.
format	Online Article Text
id	pubmed-4982549
institution	National Center for Biotechnology Information
language	English
publishDate	2016
record_format	MEDLINE/PubMed
spelling	pubmed-49825492016-08-12 Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M J Proteomics Bioinform Article Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel. 2016-05-16 2016 /pmc/articles/PMC4982549/ /pubmed/27524871 http://dx.doi.org/10.4172/jpb.S14-005 Text en http://creativecommons.org/licenses/by-nc/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Article Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title	Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_full	Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_fullStr	Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_full_unstemmed	Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_short	Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_sort	misclassification errors in unsupervised classification methods. comparison based on the simulation of targeted proteomics data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4982549/ https://www.ncbi.nlm.nih.gov/pubmed/27524871 http://dx.doi.org/10.4172/jpb.S14-005
work_keys_str_mv	AT andreevvictorp misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata AT gillespiebrendaw misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata AT helfandbriant misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata AT merionrobertm misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata

Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data

Ejemplares similares