Cargando…

Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease

BACKGROUND: Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lee, Youngro, Cappellato, Marco, Di Camillo, Barbara
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/ https://www.ncbi.nlm.nih.gov/pubmed/37882604 http://dx.doi.org/10.1093/gigascience/giad083

_version_	1785126090575970304
author	Lee, Youngro Cappellato, Marco Di Camillo, Barbara
author_facet	Lee, Youngro Cappellato, Marco Di Camillo, Barbara
author_sort	Lee, Youngro
collection	PubMed
description	BACKGROUND: Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. RESULTS: We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. CONCLUSION: Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.
format	Online Article Text
id	pubmed-10600917
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-106009172023-10-27 Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease Lee, Youngro Cappellato, Marco Di Camillo, Barbara Gigascience Research BACKGROUND: Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. RESULTS: We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. CONCLUSION: Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies. Oxford University Press 2023-10-26 /pmc/articles/PMC10600917/ /pubmed/37882604 http://dx.doi.org/10.1093/gigascience/giad083 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lee, Youngro Cappellato, Marco Di Camillo, Barbara Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title	Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_full	Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_fullStr	Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_full_unstemmed	Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_short	Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
title_sort	machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600917/ https://www.ncbi.nlm.nih.gov/pubmed/37882604 http://dx.doi.org/10.1093/gigascience/giad083
work_keys_str_mv	AT leeyoungro machinelearningbasedfeatureselectiontosearchstablemicrobialbiomarkersapplicationtoinflammatoryboweldisease AT cappellatomarco machinelearningbasedfeatureselectiontosearchstablemicrobialbiomarkersapplicationtoinflammatoryboweldisease AT dicamillobarbara machinelearningbasedfeatureselectiontosearchstablemicrobialbiomarkersapplicationtoinflammatoryboweldisease

Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease

Ejemplares similares