Cargando…
An active learning based classification strategy for the minority class problem: application to histopathology annotation
BACKGROUND: Supervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer. Generating training data for classifiers is problematic, since only domain experts (e.g. pathologists) can correctly label ground truth data. Additionally, d...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3284114/ https://www.ncbi.nlm.nih.gov/pubmed/22034914 http://dx.doi.org/10.1186/1471-2105-12-424 |
_version_ | 1782224323077996544 |
---|---|
author | Doyle, Scott Monaco, James Feldman, Michael Tomaszewski, John Madabhushi, Anant |
author_facet | Doyle, Scott Monaco, James Feldman, Michael Tomaszewski, John Madabhushi, Anant |
author_sort | Doyle, Scott |
collection | PubMed |
description | BACKGROUND: Supervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer. Generating training data for classifiers is problematic, since only domain experts (e.g. pathologists) can correctly label ground truth data. Additionally, digital pathology datasets suffer from the "minority class problem", an issue where the number of exemplars from the non-target class outnumber target class exemplars which can bias the classifier and reduce accuracy. In this paper, we develop a training strategy combining active learning (AL) with class-balancing. AL identifies unlabeled samples that are "informative" (i.e. likely to increase classifier performance) for annotation, avoiding non-informative samples. This yields high accuracy with a smaller training set size compared with random learning (RL). Previous AL methods have not explicitly accounted for the minority class problem in biomedical images. Pre-specifying a target class ratio mitigates the problem of training bias. Finally, we develop a mathematical model to predict the number of annotations (cost) required to achieve balanced training classes. In addition to predicting training cost, the model reveals the theoretical properties of AL in the context of the minority class problem. RESULTS: Using this class-balanced AL training strategy (CBAL), we build a classifier to distinguish cancer from non-cancer regions on digitized prostate histopathology. Our dataset consists of 12,000 image regions sampled from 100 biopsies (58 prostate cancer patients). We compare CBAL against: (1) unbalanced AL (UBAL), which uses AL but ignores class ratio; (2) class-balanced RL (CBRL), which uses RL with a specific class ratio; and (3) unbalanced RL (UBRL). The CBAL-trained classifier yields 2% greater accuracy and 3% higher area under the receiver operating characteristic curve (AUC) than alternatively-trained classifiers. Our cost model accurately predicts the number of annotations necessary to obtain balanced classes. The accuracy of our prediction is verified by empirically-observed costs. Finally, we find that over-sampling the minority class yields a marginal improvement in classifier accuracy but the improved performance comes at the expense of greater annotation cost. CONCLUSIONS: We have combined AL with class balancing to yield a general training strategy applicable to most supervised classification problems where the dataset is expensive to obtain and which suffers from the minority class problem. An intelligent training strategy is a critical component of supervised classification, but the integration of AL and intelligent choice of class ratios, as well as the application of a general cost model, will help researchers to plan the training process more quickly and effectively. |
format | Online Article Text |
id | pubmed-3284114 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-32841142012-02-23 An active learning based classification strategy for the minority class problem: application to histopathology annotation Doyle, Scott Monaco, James Feldman, Michael Tomaszewski, John Madabhushi, Anant BMC Bioinformatics Research Article BACKGROUND: Supervised classifiers for digital pathology can improve the ability of physicians to detect and diagnose diseases such as cancer. Generating training data for classifiers is problematic, since only domain experts (e.g. pathologists) can correctly label ground truth data. Additionally, digital pathology datasets suffer from the "minority class problem", an issue where the number of exemplars from the non-target class outnumber target class exemplars which can bias the classifier and reduce accuracy. In this paper, we develop a training strategy combining active learning (AL) with class-balancing. AL identifies unlabeled samples that are "informative" (i.e. likely to increase classifier performance) for annotation, avoiding non-informative samples. This yields high accuracy with a smaller training set size compared with random learning (RL). Previous AL methods have not explicitly accounted for the minority class problem in biomedical images. Pre-specifying a target class ratio mitigates the problem of training bias. Finally, we develop a mathematical model to predict the number of annotations (cost) required to achieve balanced training classes. In addition to predicting training cost, the model reveals the theoretical properties of AL in the context of the minority class problem. RESULTS: Using this class-balanced AL training strategy (CBAL), we build a classifier to distinguish cancer from non-cancer regions on digitized prostate histopathology. Our dataset consists of 12,000 image regions sampled from 100 biopsies (58 prostate cancer patients). We compare CBAL against: (1) unbalanced AL (UBAL), which uses AL but ignores class ratio; (2) class-balanced RL (CBRL), which uses RL with a specific class ratio; and (3) unbalanced RL (UBRL). The CBAL-trained classifier yields 2% greater accuracy and 3% higher area under the receiver operating characteristic curve (AUC) than alternatively-trained classifiers. Our cost model accurately predicts the number of annotations necessary to obtain balanced classes. The accuracy of our prediction is verified by empirically-observed costs. Finally, we find that over-sampling the minority class yields a marginal improvement in classifier accuracy but the improved performance comes at the expense of greater annotation cost. CONCLUSIONS: We have combined AL with class balancing to yield a general training strategy applicable to most supervised classification problems where the dataset is expensive to obtain and which suffers from the minority class problem. An intelligent training strategy is a critical component of supervised classification, but the integration of AL and intelligent choice of class ratios, as well as the application of a general cost model, will help researchers to plan the training process more quickly and effectively. BioMed Central 2011-10-28 /pmc/articles/PMC3284114/ /pubmed/22034914 http://dx.doi.org/10.1186/1471-2105-12-424 Text en Copyright ©2011 Doyle et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Doyle, Scott Monaco, James Feldman, Michael Tomaszewski, John Madabhushi, Anant An active learning based classification strategy for the minority class problem: application to histopathology annotation |
title | An active learning based classification strategy for the minority class problem: application to histopathology annotation |
title_full | An active learning based classification strategy for the minority class problem: application to histopathology annotation |
title_fullStr | An active learning based classification strategy for the minority class problem: application to histopathology annotation |
title_full_unstemmed | An active learning based classification strategy for the minority class problem: application to histopathology annotation |
title_short | An active learning based classification strategy for the minority class problem: application to histopathology annotation |
title_sort | active learning based classification strategy for the minority class problem: application to histopathology annotation |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3284114/ https://www.ncbi.nlm.nih.gov/pubmed/22034914 http://dx.doi.org/10.1186/1471-2105-12-424 |
work_keys_str_mv | AT doylescott anactivelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT monacojames anactivelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT feldmanmichael anactivelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT tomaszewskijohn anactivelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT madabhushianant anactivelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT doylescott activelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT monacojames activelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT feldmanmichael activelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT tomaszewskijohn activelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation AT madabhushianant activelearningbasedclassificationstrategyfortheminorityclassproblemapplicationtohistopathologyannotation |