Cargando…
Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data
Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4982549/ https://www.ncbi.nlm.nih.gov/pubmed/27524871 http://dx.doi.org/10.4172/jpb.S14-005 |
_version_ | 1782447796835581952 |
---|---|
author | Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M |
author_facet | Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M |
author_sort | Andreev, Victor P |
collection | PubMed |
description | Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel. |
format | Online Article Text |
id | pubmed-4982549 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
record_format | MEDLINE/PubMed |
spelling | pubmed-49825492016-08-12 Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M J Proteomics Bioinform Article Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel. 2016-05-16 2016 /pmc/articles/PMC4982549/ /pubmed/27524871 http://dx.doi.org/10.4172/jpb.S14-005 Text en http://creativecommons.org/licenses/by-nc/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Article Andreev, Victor P Gillespie, Brenda W Helfand, Brian T Merion, Robert M Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data |
title | Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data |
title_full | Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data |
title_fullStr | Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data |
title_full_unstemmed | Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data |
title_short | Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data |
title_sort | misclassification errors in unsupervised classification methods. comparison based on the simulation of targeted proteomics data |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4982549/ https://www.ncbi.nlm.nih.gov/pubmed/27524871 http://dx.doi.org/10.4172/jpb.S14-005 |
work_keys_str_mv | AT andreevvictorp misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata AT gillespiebrendaw misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata AT helfandbriant misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata AT merionrobertm misclassificationerrorsinunsupervisedclassificationmethodscomparisonbasedonthesimulationoftargetedproteomicsdata |