Cargando…

Predicting sample size required for classification performance

BACKGROUND: Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample requi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Figueroa, Rosa L, Zeng-Treitler , Qing, Kandula, Sasikiran, Ngo, Long H
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307431/ https://www.ncbi.nlm.nih.gov/pubmed/22336388 http://dx.doi.org/10.1186/1472-6947-12-8

_version_	1782227321184321536
author	Figueroa, Rosa L Zeng-Treitler , Qing Kandula, Sasikiran Ngo, Long H
author_facet	Figueroa, Rosa L Zeng-Treitler , Qing Kandula, Sasikiran Ngo, Long H
author_sort	Figueroa, Rosa L
collection	PubMed
description	BACKGROUND: Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target. METHODS: We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method. RESULTS: A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05). CONCLUSIONS: This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.
format	Online Article Text
id	pubmed-3307431
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-33074312012-03-20 Predicting sample size required for classification performance Figueroa, Rosa L Zeng-Treitler , Qing Kandula, Sasikiran Ngo, Long H BMC Med Inform Decis Mak Research Article BACKGROUND: Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target. METHODS: We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method. RESULTS: A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05). CONCLUSIONS: This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning. BioMed Central 2012-02-15 /pmc/articles/PMC3307431/ /pubmed/22336388 http://dx.doi.org/10.1186/1472-6947-12-8 Text en Copyright ©2012 Figueroa et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Figueroa, Rosa L Zeng-Treitler , Qing Kandula, Sasikiran Ngo, Long H Predicting sample size required for classification performance
title	Predicting sample size required for classification performance
title_full	Predicting sample size required for classification performance
title_fullStr	Predicting sample size required for classification performance
title_full_unstemmed	Predicting sample size required for classification performance
title_short	Predicting sample size required for classification performance
title_sort	predicting sample size required for classification performance
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307431/ https://www.ncbi.nlm.nih.gov/pubmed/22336388 http://dx.doi.org/10.1186/1472-6947-12-8
work_keys_str_mv	AT figueroarosal predictingsamplesizerequiredforclassificationperformance AT zengtreitlerqing predictingsamplesizerequiredforclassificationperformance AT kandulasasikiran predictingsamplesizerequiredforclassificationperformance AT ngolongh predictingsamplesizerequiredforclassificationperformance

Predicting sample size required for classification performance

Ejemplares similares