Cargando…

Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach

BACKGROUND: Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Xinyu, Wang, Yupeng, Sriram, TN
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4071155/ https://www.ncbi.nlm.nih.gov/pubmed/24930009 http://dx.doi.org/10.1186/1471-2105-15-190

_version_	1782322779554578432
author	Liu, Xinyu Wang, Yupeng Sriram, TN
author_facet	Liu, Xinyu Wang, Yupeng Sriram, TN
author_sort	Liu, Xinyu
collection	PubMed
description	BACKGROUND: Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) for two classes and the Volume Under the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not only guarantee an overall correct classification rate, but also make studies more cost-effective. RESULTS: For coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We give a sample size determination method, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones. CONCLUSION: For multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package “SampleSizeSNP” is given in Additional file 1, and a ZIP file of the R package “SampleSizeSNP” is given in Additional file 2.
format	Online Article Text
id	pubmed-4071155
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40711552014-06-27 Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach Liu, Xinyu Wang, Yupeng Sriram, TN BMC Bioinformatics Methodology Article BACKGROUND: Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) for two classes and the Volume Under the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not only guarantee an overall correct classification rate, but also make studies more cost-effective. RESULTS: For coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We give a sample size determination method, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones. CONCLUSION: For multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package “SampleSizeSNP” is given in Additional file 1, and a ZIP file of the R package “SampleSizeSNP” is given in Additional file 2. BioMed Central 2014-06-14 /pmc/articles/PMC4071155/ /pubmed/24930009 http://dx.doi.org/10.1186/1471-2105-15-190 Text en Copyright © 2014 Liu et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle	Methodology Article Liu, Xinyu Wang, Yupeng Sriram, TN Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
title	Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
title_full	Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
title_fullStr	Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
title_full_unstemmed	Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
title_short	Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
title_sort	determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4071155/ https://www.ncbi.nlm.nih.gov/pubmed/24930009 http://dx.doi.org/10.1186/1471-2105-15-190
work_keys_str_mv	AT liuxinyu determinationofsamplesizeforamulticlassclassifierbasedonsinglenucleotidepolymorphismsavolumeunderthesurfaceapproach AT wangyupeng determinationofsamplesizeforamulticlassclassifierbasedonsinglenucleotidepolymorphismsavolumeunderthesurfaceapproach AT sriramtn determinationofsamplesizeforamulticlassclassifierbasedonsinglenucleotidepolymorphismsavolumeunderthesurfaceapproach

Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach

Ejemplares similares