Cargando…

Microbiome Preprocessing Machine Learning Pipeline

BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing ste...

Descripción completa

Detalles Bibliográficos
Autores principales: Jasner, Yoel, Belogolovski, Anna, Ben-Itzhak, Meirav, Koren, Omry, Louzoun, Yoram
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8250139/
https://www.ncbi.nlm.nih.gov/pubmed/34220823
http://dx.doi.org/10.3389/fimmu.2021.677870
_version_ 1783717026999566336
author Jasner, Yoel
Belogolovski, Anna
Ben-Itzhak, Meirav
Koren, Omry
Louzoun, Yoram
author_facet Jasner, Yoel
Belogolovski, Anna
Ben-Itzhak, Meirav
Koren, Omry
Louzoun, Yoram
author_sort Jasner, Yoel
collection PubMed
description BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. RESULTS: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. CONCLUSIONS: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets.
format Online
Article
Text
id pubmed-8250139
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-82501392021-07-03 Microbiome Preprocessing Machine Learning Pipeline Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram Front Immunol Immunology BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. RESULTS: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. CONCLUSIONS: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets. Frontiers Media S.A. 2021-06-18 /pmc/articles/PMC8250139/ /pubmed/34220823 http://dx.doi.org/10.3389/fimmu.2021.677870 Text en Copyright © 2021 Jasner, Belogolovski, Ben-Itzhak, Koren and Louzoun https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Immunology
Jasner, Yoel
Belogolovski, Anna
Ben-Itzhak, Meirav
Koren, Omry
Louzoun, Yoram
Microbiome Preprocessing Machine Learning Pipeline
title Microbiome Preprocessing Machine Learning Pipeline
title_full Microbiome Preprocessing Machine Learning Pipeline
title_fullStr Microbiome Preprocessing Machine Learning Pipeline
title_full_unstemmed Microbiome Preprocessing Machine Learning Pipeline
title_short Microbiome Preprocessing Machine Learning Pipeline
title_sort microbiome preprocessing machine learning pipeline
topic Immunology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8250139/
https://www.ncbi.nlm.nih.gov/pubmed/34220823
http://dx.doi.org/10.3389/fimmu.2021.677870
work_keys_str_mv AT jasneryoel microbiomepreprocessingmachinelearningpipeline
AT belogolovskianna microbiomepreprocessingmachinelearningpipeline
AT benitzhakmeirav microbiomepreprocessingmachinelearningpipeline
AT korenomry microbiomepreprocessingmachinelearningpipeline
AT louzounyoram microbiomepreprocessingmachinelearningpipeline