Cargando…

Microbiome Preprocessing Machine Learning Pipeline

BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing ste...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jasner, Yoel, Belogolovski, Anna, Ben-Itzhak, Meirav, Koren, Omry, Louzoun, Yoram
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Immunology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8250139/ https://www.ncbi.nlm.nih.gov/pubmed/34220823 http://dx.doi.org/10.3389/fimmu.2021.677870

_version_	1783717026999566336
author	Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram
author_facet	Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram
author_sort	Jasner, Yoel
collection	PubMed
description	BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. RESULTS: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. CONCLUSIONS: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets.
format	Online Article Text
id	pubmed-8250139
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-82501392021-07-03 Microbiome Preprocessing Machine Learning Pipeline Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram Front Immunol Immunology BACKGROUND: 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. METHODS: We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. RESULTS: We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. CONCLUSIONS: The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets. Frontiers Media S.A. 2021-06-18 /pmc/articles/PMC8250139/ /pubmed/34220823 http://dx.doi.org/10.3389/fimmu.2021.677870 Text en Copyright © 2021 Jasner, Belogolovski, Ben-Itzhak, Koren and Louzoun https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Immunology Jasner, Yoel Belogolovski, Anna Ben-Itzhak, Meirav Koren, Omry Louzoun, Yoram Microbiome Preprocessing Machine Learning Pipeline
title	Microbiome Preprocessing Machine Learning Pipeline
title_full	Microbiome Preprocessing Machine Learning Pipeline
title_fullStr	Microbiome Preprocessing Machine Learning Pipeline
title_full_unstemmed	Microbiome Preprocessing Machine Learning Pipeline
title_short	Microbiome Preprocessing Machine Learning Pipeline
title_sort	microbiome preprocessing machine learning pipeline
topic	Immunology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8250139/ https://www.ncbi.nlm.nih.gov/pubmed/34220823 http://dx.doi.org/10.3389/fimmu.2021.677870
work_keys_str_mv	AT jasneryoel microbiomepreprocessingmachinelearningpipeline AT belogolovskianna microbiomepreprocessingmachinelearningpipeline AT benitzhakmeirav microbiomepreprocessingmachinelearningpipeline AT korenomry microbiomepreprocessingmachinelearningpipeline AT louzounyoram microbiomepreprocessingmachinelearningpipeline

Microbiome Preprocessing Machine Learning Pipeline

Ejemplares similares