Cargando…
Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning
Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practic...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9343866/ https://www.ncbi.nlm.nih.gov/pubmed/35928158 http://dx.doi.org/10.3389/fmicb.2022.886201 |
_version_ | 1784761086803705856 |
---|---|
author | Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred |
author_facet | Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred |
author_sort | Díez López, Celia |
collection | PubMed |
description | Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits. |
format | Online Article Text |
id | pubmed-9343866 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-93438662022-08-03 Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred Front Microbiol Microbiology Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits. Frontiers Media S.A. 2022-07-19 /pmc/articles/PMC9343866/ /pubmed/35928158 http://dx.doi.org/10.3389/fmicb.2022.886201 Text en Copyright © 2022 Díez López, Montiel González, Vidaki and Kayser. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Microbiology Díez López, Celia Montiel González, Diego Vidaki, Athina Kayser, Manfred Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning |
title | Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning |
title_full | Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning |
title_fullStr | Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning |
title_full_unstemmed | Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning |
title_short | Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning |
title_sort | prediction of smoking habits from class-imbalanced saliva microbiome data using data augmentation and machine learning |
topic | Microbiology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9343866/ https://www.ncbi.nlm.nih.gov/pubmed/35928158 http://dx.doi.org/10.3389/fmicb.2022.886201 |
work_keys_str_mv | AT diezlopezcelia predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning AT montielgonzalezdiego predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning AT vidakiathina predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning AT kaysermanfred predictionofsmokinghabitsfromclassimbalancedsalivamicrobiomedatausingdataaugmentationandmachinelearning |