Cargando…

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Díez López, Celia, Montiel González, Diego, Vidaki, Athina, Kayser, Manfred
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Microbiology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9343866/ https://www.ncbi.nlm.nih.gov/pubmed/35928158 http://dx.doi.org/10.3389/fmicb.2022.886201

_version_	1784761086803705856
author	Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred
author_facet	Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred
author_sort	Díez López, Celia
collection	PubMed
description	Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.
format	Online Article Text
id	pubmed-9343866
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-93438662022-08-03 Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred Front Microbiol Microbiology Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits. Frontiers Media S.A. 2022-07-19 /pmc/articles/PMC9343866/ /pubmed/35928158 http://dx.doi.org/10.3389/fmicb.2022.886201 Text en Copyright © 2022 Díez López, Montiel González, Vidaki and Kayser. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Microbiology Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title	Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_full	Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_fullStr	Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_full_unstemmed	Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_short	Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_sort	prediction of smoking habits from class-imbalanced saliva microbiome data using data augmentation and machine learning
topic	Microbiology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9343866/ https://www.ncbi.nlm.nih.gov/pubmed/35928158 http://dx.doi.org/10.3389/fmicb.2022.886201
work_keys_str_mv	AT diezlopezcelia predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning AT montielgonzalezdiego predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning AT vidakiathina predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning AT kaysermanfred predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Ejemplares similares