Cargando…

Supervised learning and model analysis with compositional data

Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of th...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Shimeng, Ailer, Elisabeth, Kilbertus, Niki, Pfister, Niklas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10343141/
https://www.ncbi.nlm.nih.gov/pubmed/37390111
http://dx.doi.org/10.1371/journal.pcbi.1011240
_version_ 1785072667543470080
author Huang, Shimeng
Ailer, Elisabeth
Kilbertus, Niki
Pfister, Niklas
author_facet Huang, Shimeng
Ailer, Elisabeth
Kilbertus, Niki
Pfister, Niklas
author_sort Huang, Shimeng
collection PubMed
description Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.
format Online
Article
Text
id pubmed-10343141
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-103431412023-07-14 Supervised learning and model analysis with compositional data Huang, Shimeng Ailer, Elisabeth Kilbertus, Niki Pfister, Niklas PLoS Comput Biol Research Article Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome. Public Library of Science 2023-06-30 /pmc/articles/PMC10343141/ /pubmed/37390111 http://dx.doi.org/10.1371/journal.pcbi.1011240 Text en © 2023 Huang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Huang, Shimeng
Ailer, Elisabeth
Kilbertus, Niki
Pfister, Niklas
Supervised learning and model analysis with compositional data
title Supervised learning and model analysis with compositional data
title_full Supervised learning and model analysis with compositional data
title_fullStr Supervised learning and model analysis with compositional data
title_full_unstemmed Supervised learning and model analysis with compositional data
title_short Supervised learning and model analysis with compositional data
title_sort supervised learning and model analysis with compositional data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10343141/
https://www.ncbi.nlm.nih.gov/pubmed/37390111
http://dx.doi.org/10.1371/journal.pcbi.1011240
work_keys_str_mv AT huangshimeng supervisedlearningandmodelanalysiswithcompositionaldata
AT ailerelisabeth supervisedlearningandmodelanalysiswithcompositionaldata
AT kilbertusniki supervisedlearningandmodelanalysiswithcompositionaldata
AT pfisterniklas supervisedlearningandmodelanalysiswithcompositionaldata