Cargando…

Synthetic minority oversampling of vital statistics data with generative adversarial networks

OBJECTIVE: Minority oversampling is a standard approach used for adjusting the ratio between the classes on imbalanced data. However, established methods often provide modest improvements in classification performance when applied to data with extremely imbalanced class distribution and to mixed-typ...

Descripción completa

Detalles Bibliográficos
Autores principales: Koivu, Aki, Sairanen, Mikko, Airola, Antti, Pahikkala, Tapio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7750982/
https://www.ncbi.nlm.nih.gov/pubmed/32885818
http://dx.doi.org/10.1093/jamia/ocaa127
_version_ 1783625583195848704
author Koivu, Aki
Sairanen, Mikko
Airola, Antti
Pahikkala, Tapio
author_facet Koivu, Aki
Sairanen, Mikko
Airola, Antti
Pahikkala, Tapio
author_sort Koivu, Aki
collection PubMed
description OBJECTIVE: Minority oversampling is a standard approach used for adjusting the ratio between the classes on imbalanced data. However, established methods often provide modest improvements in classification performance when applied to data with extremely imbalanced class distribution and to mixed-type data. This is usual for vital statistics data, in which the outcome incidence dictates the amount of positive observations. In this article, we developed a novel neural network-based oversampling method called actGAN (activation-specific generative adversarial network) that can derive useful synthetic observations in terms of increasing prediction performance in this context. MATERIALS AND METHODS: From vital statistics data, the outcome of early stillbirth was chosen to be predicted based on demographics, pregnancy history, and infections. The data contained 363 560 live births and 139 early stillbirths, resulting in class imbalance of 99.96% and 0.04%. The hyperparameters of actGAN and a baseline method SMOTE-NC (Synthetic Minority Over-sampling Technique-Nominal Continuous) were tuned with Bayesian optimization, and both were compared against a cost-sensitive learning-only approach. RESULTS: While SMOTE-NC provided mixed results, actGAN was able to improve true positive rate at a clinically significant false positive rate and area under the curve from the receiver-operating characteristic curve consistently. DISCUSSION: Including an activation-specific output layer to a generator network of actGAN enables the addition of information about the underlying data structure, which overperforms the nominal mechanism of SMOTE-NC. CONCLUSIONS: actGAN provides an improvement to the prediction performance for our learning task. Our developed method could be applied to other mixed-type data prediction tasks that are known to be afflicted by class imbalance and limited data availability.
format Online
Article
Text
id pubmed-7750982
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-77509822020-12-28 Synthetic minority oversampling of vital statistics data with generative adversarial networks Koivu, Aki Sairanen, Mikko Airola, Antti Pahikkala, Tapio J Am Med Inform Assoc Research and Applications OBJECTIVE: Minority oversampling is a standard approach used for adjusting the ratio between the classes on imbalanced data. However, established methods often provide modest improvements in classification performance when applied to data with extremely imbalanced class distribution and to mixed-type data. This is usual for vital statistics data, in which the outcome incidence dictates the amount of positive observations. In this article, we developed a novel neural network-based oversampling method called actGAN (activation-specific generative adversarial network) that can derive useful synthetic observations in terms of increasing prediction performance in this context. MATERIALS AND METHODS: From vital statistics data, the outcome of early stillbirth was chosen to be predicted based on demographics, pregnancy history, and infections. The data contained 363 560 live births and 139 early stillbirths, resulting in class imbalance of 99.96% and 0.04%. The hyperparameters of actGAN and a baseline method SMOTE-NC (Synthetic Minority Over-sampling Technique-Nominal Continuous) were tuned with Bayesian optimization, and both were compared against a cost-sensitive learning-only approach. RESULTS: While SMOTE-NC provided mixed results, actGAN was able to improve true positive rate at a clinically significant false positive rate and area under the curve from the receiver-operating characteristic curve consistently. DISCUSSION: Including an activation-specific output layer to a generator network of actGAN enables the addition of information about the underlying data structure, which overperforms the nominal mechanism of SMOTE-NC. CONCLUSIONS: actGAN provides an improvement to the prediction performance for our learning task. Our developed method could be applied to other mixed-type data prediction tasks that are known to be afflicted by class imbalance and limited data availability. Oxford University Press 2020-09-04 /pmc/articles/PMC7750982/ /pubmed/32885818 http://dx.doi.org/10.1093/jamia/ocaa127 Text en © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Research and Applications
Koivu, Aki
Sairanen, Mikko
Airola, Antti
Pahikkala, Tapio
Synthetic minority oversampling of vital statistics data with generative adversarial networks
title Synthetic minority oversampling of vital statistics data with generative adversarial networks
title_full Synthetic minority oversampling of vital statistics data with generative adversarial networks
title_fullStr Synthetic minority oversampling of vital statistics data with generative adversarial networks
title_full_unstemmed Synthetic minority oversampling of vital statistics data with generative adversarial networks
title_short Synthetic minority oversampling of vital statistics data with generative adversarial networks
title_sort synthetic minority oversampling of vital statistics data with generative adversarial networks
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7750982/
https://www.ncbi.nlm.nih.gov/pubmed/32885818
http://dx.doi.org/10.1093/jamia/ocaa127
work_keys_str_mv AT koivuaki syntheticminorityoversamplingofvitalstatisticsdatawithgenerativeadversarialnetworks
AT sairanenmikko syntheticminorityoversamplingofvitalstatisticsdatawithgenerativeadversarialnetworks
AT airolaantti syntheticminorityoversamplingofvitalstatisticsdatawithgenerativeadversarialnetworks
AT pahikkalatapio syntheticminorityoversamplingofvitalstatisticsdatawithgenerativeadversarialnetworks