Cargando…
Microbiome Preprocessing Machine Learning Pipeline
BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing ste...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8250139/ https://www.ncbi.nlm.nih.gov/pubmed/34220823 http://dx.doi.org/10.3389/fimmu.2021.677870 |
_version_ | 1783717026999566336 |
---|---|
author | Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram |
author_facet | Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram |
author_sort | Jasner, Yoel |
collection | PubMed |
description | BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. RESULTS: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. CONCLUSIONS: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets. |
format | Online Article Text |
id | pubmed-8250139 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-82501392021-07-03 Microbiome Preprocessing Machine Learning Pipeline Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram Front Immunol Immunology BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. RESULTS: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. CONCLUSIONS: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets. Frontiers Media S.A. 2021-06-18 /pmc/articles/PMC8250139/ /pubmed/34220823 http://dx.doi.org/10.3389/fimmu.2021.677870 Text en Copyright © 2021 Jasner, Belogolovski, Ben-Itzhak, Koren and Louzoun https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Immunology Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram Microbiome Preprocessing Machine Learning Pipeline |
title | Microbiome Preprocessing Machine Learning Pipeline |
title_full | Microbiome Preprocessing Machine Learning Pipeline |
title_fullStr | Microbiome Preprocessing Machine Learning Pipeline |
title_full_unstemmed | Microbiome Preprocessing Machine Learning Pipeline |
title_short | Microbiome Preprocessing Machine Learning Pipeline |
title_sort | microbiome preprocessing machine learning pipeline |
topic | Immunology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8250139/ https://www.ncbi.nlm.nih.gov/pubmed/34220823 http://dx.doi.org/10.3389/fimmu.2021.677870 |
work_keys_str_mv | AT jasneryoel microbiomepreprocessingmachinelearningpipeline AT belogolovskianna microbiomepreprocessingmachinelearningpipeline AT benitzhakmeirav microbiomepreprocessingmachinelearningpipeline AT korenomry microbiomepreprocessingmachinelearningpipeline AT louzounyoram microbiomepreprocessingmachinelearningpipeline |