Cargando…

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms

BACKGROUND: Data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Guo, Yu, Graber, Armin, McBurney, Robert N, Balasubramanian, Raji
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2942858/ https://www.ncbi.nlm.nih.gov/pubmed/20815881 http://dx.doi.org/10.1186/1471-2105-11-447

_version_	1782186974131519488
author	Guo, Yu Graber, Armin McBurney, Robert N Balasubramanian, Raji
author_facet	Guo, Yu Graber, Armin McBurney, Robert N Balasubramanian, Raji
author_sort	Guo, Yu
collection	PubMed
description	BACKGROUND: Data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques. RESULTS: The analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper. CONCLUSION: No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.
format	Text
id	pubmed-2942858
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-29428582010-10-01 Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms Guo, Yu Graber, Armin McBurney, Robert N Balasubramanian, Raji BMC Bioinformatics Research Article BACKGROUND: Data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques. RESULTS: The analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper. CONCLUSION: No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data. BioMed Central 2010-09-03 /pmc/articles/PMC2942858/ /pubmed/20815881 http://dx.doi.org/10.1186/1471-2105-11-447 Text en Copyright ©2010 Guo et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Guo, Yu Graber, Armin McBurney, Robert N Balasubramanian, Raji Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title	Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_full	Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_fullStr	Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_full_unstemmed	Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_short	Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_sort	sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2942858/ https://www.ncbi.nlm.nih.gov/pubmed/20815881 http://dx.doi.org/10.1186/1471-2105-11-447
work_keys_str_mv	AT guoyu samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms AT graberarmin samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms AT mcburneyrobertn samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms AT balasubramanianraji samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms

Ejemplares similares