Cargando…

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practic...

Descripción completa

Detalles Bibliográficos
Autores principales: Díez López, Celia, Montiel González, Diego, Vidaki, Athina, Kayser, Manfred
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9343866/
https://www.ncbi.nlm.nih.gov/pubmed/35928158
http://dx.doi.org/10.3389/fmicb.2022.886201
_version_ 1784761086803705856
author Díez López, Celia
Montiel González, Diego
Vidaki, Athina
Kayser, Manfred
author_facet Díez López, Celia
Montiel González, Diego
Vidaki, Athina
Kayser, Manfred
author_sort Díez López, Celia
collection PubMed
description Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.
format Online
Article
Text
id pubmed-9343866
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-93438662022-08-03 Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred Front Microbiol Microbiology Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits. Frontiers Media S.A. 2022-07-19 /pmc/articles/PMC9343866/ /pubmed/35928158 http://dx.doi.org/10.3389/fmicb.2022.886201 Text en Copyright © 2022 Díez López, Montiel González, Vidaki and Kayser. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Microbiology
Díez López, Celia
Montiel González, Diego
Vidaki, Athina
Kayser, Manfred
Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_full Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_fullStr Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_full_unstemmed Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_short Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
title_sort prediction of smoking habits from class-imbalanced saliva microbiome data using data augmentation and machine learning
topic Microbiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9343866/
https://www.ncbi.nlm.nih.gov/pubmed/35928158
http://dx.doi.org/10.3389/fmicb.2022.886201
work_keys_str_mv AT diezlopezcelia predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning
AT montielgonzalezdiego predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning
AT vidakiathina predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning
AT kaysermanfred predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning