Cargando…

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample sel...

Descripción completa

Detalles Bibliográficos
Autores principales:	Krautenbacher, Norbert, Theis, Fabian J., Fuchs, Christiane
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5632994/ https://www.ncbi.nlm.nih.gov/pubmed/29312464 http://dx.doi.org/10.1155/2017/7847531

_version_	1783269808807084032
author	Krautenbacher, Norbert Theis, Fabian J. Fuchs, Christiane
author_facet	Krautenbacher, Norbert Theis, Fabian J. Fuchs, Christiane
author_sort	Krautenbacher, Norbert
collection	PubMed
description	Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.
format	Online Article Text
id	pubmed-5632994
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Hindawi
record_format	MEDLINE/PubMed
spelling	pubmed-56329942018-01-08 Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies Krautenbacher, Norbert Theis, Fabian J. Fuchs, Christiane Comput Math Methods Med Research Article Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia. Hindawi 2017 2017-09-24 /pmc/articles/PMC5632994/ /pubmed/29312464 http://dx.doi.org/10.1155/2017/7847531 Text en Copyright © 2017 Norbert Krautenbacher et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Krautenbacher, Norbert Theis, Fabian J. Fuchs, Christiane Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
title	Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
title_full	Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
title_fullStr	Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
title_full_unstemmed	Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
title_short	Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
title_sort	correcting classifiers for sample selection bias in two-phase case-control studies
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5632994/ https://www.ncbi.nlm.nih.gov/pubmed/29312464 http://dx.doi.org/10.1155/2017/7847531
work_keys_str_mv	AT krautenbachernorbert correctingclassifiersforsampleselectionbiasintwophasecasecontrolstudies AT theisfabianj correctingclassifiersforsampleselectionbiasintwophasecasecontrolstudies AT fuchschristiane correctingclassifiersforsampleselectionbiasintwophasecasecontrolstudies

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Ejemplares similares