Cargando…

Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease

BACKGROUND: Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Youngro, Cappellato, Marco, Di Camillo, Barbara
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/
https://www.ncbi.nlm.nih.gov/pubmed/37882604
http://dx.doi.org/10.1093/gigascience/giad083
_version_ 1785126090575970304
author Lee, Youngro
Cappellato, Marco
Di Camillo, Barbara
author_facet Lee, Youngro
Cappellato, Marco
Di Camillo, Barbara
author_sort Lee, Youngro
collection PubMed
description BACKGROUND: Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. RESULTS: We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. CONCLUSION: Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.
format Online
Article
Text
id pubmed-10600917
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-106009172023-10-27 Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease Lee, Youngro Cappellato, Marco Di Camillo, Barbara Gigascience Research BACKGROUND: Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. RESULTS: We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. CONCLUSION: Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies. Oxford University Press 2023-10-26 /pmc/articles/PMC10600917/ /pubmed/37882604 http://dx.doi.org/10.1093/gigascience/giad083 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Lee, Youngro
Cappellato, Marco
Di Camillo, Barbara
Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_full Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_fullStr Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_full_unstemmed Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_short Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_sort machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/
https://www.ncbi.nlm.nih.gov/pubmed/37882604
http://dx.doi.org/10.1093/gigascience/giad083
work_keys_str_mv AT leeyoungro machinelearningbasedfeatureselectiontosearchstablemicrobialbiomarkersapplicationtoinflammatoryboweldisease
AT cappellatomarco machinelearningbasedfeatureselectiontosearchstablemicrobialbiomarkersapplicationtoinflammatoryboweldisease
AT dicamillobarbara machinelearningbasedfeatureselectiontosearchstablemicrobialbiomarkersapplicationtoinflammatoryboweldisease