Cargando…

Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression

BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to i...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dipnall, Joanna F., Pasco, Julie A., Berk, Michael, Williams, Lana J., Dodd, Seetal, Jacka, Felice N., Meyer, Denny
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4744063/ https://www.ncbi.nlm.nih.gov/pubmed/26848571 http://dx.doi.org/10.1371/journal.pone.0148195

_version_	1782414438638288896
author	Dipnall, Joanna F. Pasco, Julie A. Berk, Michael Williams, Lana J. Dodd, Seetal Jacka, Felice N. Meyer, Denny
author_facet	Dipnall, Joanna F. Pasco, Julie A. Berk, Michael Williams, Lana J. Dodd, Seetal Jacka, Felice N. Meyer, Denny
author_sort	Dipnall, Joanna F.
collection	PubMed
description	BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. METHODS: The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. RESULTS: After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). CONCLUSION: The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.
format	Online Article Text
id	pubmed-4744063
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-47440632016-02-11 Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression Dipnall, Joanna F. Pasco, Julie A. Berk, Michael Williams, Lana J. Dodd, Seetal Jacka, Felice N. Meyer, Denny PLoS One Research Article BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. METHODS: The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. RESULTS: After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). CONCLUSION: The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. Public Library of Science 2016-02-05 /pmc/articles/PMC4744063/ /pubmed/26848571 http://dx.doi.org/10.1371/journal.pone.0148195 Text en © 2016 Dipnall et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Dipnall, Joanna F. Pasco, Julie A. Berk, Michael Williams, Lana J. Dodd, Seetal Jacka, Felice N. Meyer, Denny Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression
title	Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression
title_full	Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression
title_fullStr	Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression
title_full_unstemmed	Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression
title_short	Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression
title_sort	fusing data mining, machine learning and traditional statistics to detect biomarkers associated with depression
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4744063/ https://www.ncbi.nlm.nih.gov/pubmed/26848571 http://dx.doi.org/10.1371/journal.pone.0148195
work_keys_str_mv	AT dipnalljoannaf fusingdataminingmachinelearningandtraditionalstatisticstodetectbiomarkersassociatedwithdepression AT pascojuliea fusingdataminingmachinelearningandtraditionalstatisticstodetectbiomarkersassociatedwithdepression AT berkmichael fusingdataminingmachinelearningandtraditionalstatisticstodetectbiomarkersassociatedwithdepression AT williamslanaj fusingdataminingmachinelearningandtraditionalstatisticstodetectbiomarkersassociatedwithdepression AT doddseetal fusingdataminingmachinelearningandtraditionalstatisticstodetectbiomarkersassociatedwithdepression AT jackafelicen fusingdataminingmachinelearningandtraditionalstatisticstodetectbiomarkersassociatedwithdepression AT meyerdenny fusingdataminingmachinelearningandtraditionalstatisticstodetectbiomarkersassociatedwithdepression

Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression

Ejemplares similares