Cargando…

Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data

Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to...

Descripción completa

Detalles Bibliográficos
Autores principales: Andreev, Victor P, Gillespie, Brenda W, Helfand, Brian T, Merion, Robert M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4982549/
https://www.ncbi.nlm.nih.gov/pubmed/27524871
http://dx.doi.org/10.4172/jpb.S14-005
_version_ 1782447796835581952
author Andreev, Victor P
Gillespie, Brenda W
Helfand, Brian T
Merion, Robert M
author_facet Andreev, Victor P
Gillespie, Brenda W
Helfand, Brian T
Merion, Robert M
author_sort Andreev, Victor P
collection PubMed
description Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel.
format Online
Article
Text
id pubmed-4982549
institution National Center for Biotechnology Information
language English
publishDate 2016
record_format MEDLINE/PubMed
spelling pubmed-49825492016-08-12 Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M J Proteomics Bioinform Article Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel. 2016-05-16 2016 /pmc/articles/PMC4982549/ /pubmed/27524871 http://dx.doi.org/10.4172/jpb.S14-005 Text en http://creativecommons.org/licenses/by-nc/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Article
Andreev, Victor P
Gillespie, Brenda W
Helfand, Brian T
Merion, Robert M
Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_full Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_fullStr Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_full_unstemmed Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_short Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
title_sort misclassification errors in unsupervised classification methods. comparison based on the simulation of targeted proteomics data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4982549/
https://www.ncbi.nlm.nih.gov/pubmed/27524871
http://dx.doi.org/10.4172/jpb.S14-005
work_keys_str_mv AT andreevvictorp misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata
AT gillespiebrendaw misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata
AT helfandbriant misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata
AT merionrobertm misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata