Cargando…

Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines

BACKGROUND: Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. METHODS: The impact of three balancing methods and...

Descripción completa

Detalles Bibliográficos
Autores principales: Richardson, Alice M., Lidbury, Brett A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5557531/
https://www.ncbi.nlm.nih.gov/pubmed/28806936
http://dx.doi.org/10.1186/s12911-017-0522-5
_version_ 1783257224141864960
author Richardson, Alice M.
Lidbury, Brett A.
author_facet Richardson, Alice M.
Lidbury, Brett A.
author_sort Richardson, Alice M.
collection PubMed
description BACKGROUND: Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. METHODS: The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007. RESULTS: Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model. CONCLUSIONS: Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis.
format Online
Article
Text
id pubmed-5557531
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-55575312017-08-16 Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines Richardson, Alice M. Lidbury, Brett A. BMC Med Inform Decis Mak Research Article BACKGROUND: Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases. METHODS: The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007. RESULTS: Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model. CONCLUSIONS: Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis. BioMed Central 2017-08-14 /pmc/articles/PMC5557531/ /pubmed/28806936 http://dx.doi.org/10.1186/s12911-017-0522-5 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Richardson, Alice M.
Lidbury, Brett A.
Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_full Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_fullStr Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_full_unstemmed Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_short Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
title_sort enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5557531/
https://www.ncbi.nlm.nih.gov/pubmed/28806936
http://dx.doi.org/10.1186/s12911-017-0522-5
work_keys_str_mv AT richardsonalicem enhancementofhepatitisvirusimmunoassayoutcomepredictionsinimbalancedroutinepathologydatabydatabalancingandfeatureselectionbeforetheapplicationofsupportvectormachines
AT lidburybretta enhancementofhepatitisvirusimmunoassayoutcomepredictionsinimbalancedroutinepathologydatabydatabalancingandfeatureselectionbeforetheapplicationofsupportvectormachines