Cargando…

An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study

Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream da...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Xinxin, Lee, Jimmy, Goh, Wilson Wen Bin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9156999/ https://www.ncbi.nlm.nih.gov/pubmed/35663731 http://dx.doi.org/10.1016/j.heliyon.2022.e09502

_version_	1784718553770885120
author	Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin
author_facet	Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin
author_sort	Zhang, Xinxin
collection	PubMed
description	Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement.
format	Online Article Text
id	pubmed-9156999
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-91569992022-06-02 An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin Heliyon Research Article Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement. Elsevier 2022-05-21 /pmc/articles/PMC9156999/ /pubmed/35663731 http://dx.doi.org/10.1016/j.heliyon.2022.e09502 Text en © 2022 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Research Article Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_full	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_fullStr	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_full_unstemmed	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_short	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_sort	investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9156999/ https://www.ncbi.nlm.nih.gov/pubmed/35663731 http://dx.doi.org/10.1016/j.heliyon.2022.e09502
work_keys_str_mv	AT zhangxinxin aninvestigationofhownormalisationandlocalmodellingtechniquesconfoundmachinelearningperformanceinamentalhealthstudy AT leejimmy aninvestigationofhownormalisationandlocalmodellingtechniquesconfoundmachinelearningperformanceinamentalhealthstudy AT gohwilsonwenbin aninvestigationofhownormalisationandlocalmodellingtechniquesconfoundmachinelearningperformanceinamentalhealthstudy AT zhangxinxin investigationofhownormalisationandlocalmodellingtechniquesconfoundmachinelearningperformanceinamentalhealthstudy AT leejimmy investigationofhownormalisationandlocalmodellingtechniquesconfoundmachinelearningperformanceinamentalhealthstudy AT gohwilsonwenbin investigationofhownormalisationandlocalmodellingtechniquesconfoundmachinelearningperformanceinamentalhealthstudy

An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study

Ejemplares similares