Cargando…

A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women

We developed a reusable and open-source machine learning (ML) pipeline that can provide an analytical framework for rigorous biomarker discovery. We implemented the ML pipeline to determine the predictive potential of clinical and immunoproteome antibody data for outcomes associated with Chlamydia t...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Chuwen, Mokashi, Neha Vivek, Darville, Toni, Sun, Xuejun, O’Connell, Catherine M., Hufnagel, Katrin, Waterboer, Tim, Zheng, Xiaojing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society for Microbiology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10434056/
https://www.ncbi.nlm.nih.gov/pubmed/37318345
http://dx.doi.org/10.1128/spectrum.04689-22
_version_ 1785091793686102016
author Liu, Chuwen
Mokashi, Neha Vivek
Darville, Toni
Sun, Xuejun
O’Connell, Catherine M.
Hufnagel, Katrin
Waterboer, Tim
Zheng, Xiaojing
author_facet Liu, Chuwen
Mokashi, Neha Vivek
Darville, Toni
Sun, Xuejun
O’Connell, Catherine M.
Hufnagel, Katrin
Waterboer, Tim
Zheng, Xiaojing
author_sort Liu, Chuwen
collection PubMed
description We developed a reusable and open-source machine learning (ML) pipeline that can provide an analytical framework for rigorous biomarker discovery. We implemented the ML pipeline to determine the predictive potential of clinical and immunoproteome antibody data for outcomes associated with Chlamydia trachomatis (Ct) infection collected from 222 cis-gender females with high Ct exposure. We compared the predictive performance of 4 ML algorithms (naive Bayes, random forest, extreme gradient boosting with linear booster [xgbLinear], and k-nearest neighbors [KNN]), screened from 215 ML methods, in combination with two different feature selection strategies, Boruta and recursive feature elimination. Recursive feature elimination performed better than Boruta in this study. In prediction of Ct ascending infection, naive Bayes yielded a slightly higher median value of are under the receiver operating characteristic curve (AUROC) 0.57 (95% confidence interval [CI], 0.54 to 0.59) than other methods and provided biological interpretability. For prediction of incident infection among women uninfected at enrollment, KNN performed slightly better than other algorithms, with a median AUROC of 0.61 (95% CI, 0.49 to 0.70). In contrast, xgbLinear and random forest had higher predictive performances, with median AUROC of 0.63 (95% CI, 0.58 to 0.67) and 0.62 (95% CI, 0.58 to 0.64), respectively, for women infected at enrollment. Our findings suggest that clinical factors and serum anti-Ct protein IgGs are inadequate biomarkers for ascension or incident Ct infection. Nevertheless, our analysis highlights the utility of a pipeline that searches for biomarkers and evaluates prediction performance and interpretability. IMPORTANCE Biomarker discovery to aid early diagnosis and treatment using machine learning (ML) approaches is a rapidly developing area in host-microbe studies. However, lack of reproducibility and interpretability of ML-driven biomarker analysis hinders selection of robust biomarkers that can be applied in clinical practice. We thus developed a rigorous ML analytical framework and provide recommendations for enhancing reproducibility of biomarkers. We emphasize the importance of robustness in selection of ML methods, evaluation of performance, and interpretability of biomarkers. Our ML pipeline is reusable and open-source and can be used not only to identify host-pathogen interaction biomarkers but also in microbiome studies and ecological and environmental microbiology research.
format Online
Article
Text
id pubmed-10434056
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Society for Microbiology
record_format MEDLINE/PubMed
spelling pubmed-104340562023-08-18 A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women Liu, Chuwen Mokashi, Neha Vivek Darville, Toni Sun, Xuejun O’Connell, Catherine M. Hufnagel, Katrin Waterboer, Tim Zheng, Xiaojing Microbiol Spectr Research Article We developed a reusable and open-source machine learning (ML) pipeline that can provide an analytical framework for rigorous biomarker discovery. We implemented the ML pipeline to determine the predictive potential of clinical and immunoproteome antibody data for outcomes associated with Chlamydia trachomatis (Ct) infection collected from 222 cis-gender females with high Ct exposure. We compared the predictive performance of 4 ML algorithms (naive Bayes, random forest, extreme gradient boosting with linear booster [xgbLinear], and k-nearest neighbors [KNN]), screened from 215 ML methods, in combination with two different feature selection strategies, Boruta and recursive feature elimination. Recursive feature elimination performed better than Boruta in this study. In prediction of Ct ascending infection, naive Bayes yielded a slightly higher median value of are under the receiver operating characteristic curve (AUROC) 0.57 (95% confidence interval [CI], 0.54 to 0.59) than other methods and provided biological interpretability. For prediction of incident infection among women uninfected at enrollment, KNN performed slightly better than other algorithms, with a median AUROC of 0.61 (95% CI, 0.49 to 0.70). In contrast, xgbLinear and random forest had higher predictive performances, with median AUROC of 0.63 (95% CI, 0.58 to 0.67) and 0.62 (95% CI, 0.58 to 0.64), respectively, for women infected at enrollment. Our findings suggest that clinical factors and serum anti-Ct protein IgGs are inadequate biomarkers for ascension or incident Ct infection. Nevertheless, our analysis highlights the utility of a pipeline that searches for biomarkers and evaluates prediction performance and interpretability. IMPORTANCE Biomarker discovery to aid early diagnosis and treatment using machine learning (ML) approaches is a rapidly developing area in host-microbe studies. However, lack of reproducibility and interpretability of ML-driven biomarker analysis hinders selection of robust biomarkers that can be applied in clinical practice. We thus developed a rigorous ML analytical framework and provide recommendations for enhancing reproducibility of biomarkers. We emphasize the importance of robustness in selection of ML methods, evaluation of performance, and interpretability of biomarkers. Our ML pipeline is reusable and open-source and can be used not only to identify host-pathogen interaction biomarkers but also in microbiome studies and ecological and environmental microbiology research. American Society for Microbiology 2023-06-15 /pmc/articles/PMC10434056/ /pubmed/37318345 http://dx.doi.org/10.1128/spectrum.04689-22 Text en Copyright © 2023 Liu et al. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research Article
Liu, Chuwen
Mokashi, Neha Vivek
Darville, Toni
Sun, Xuejun
O’Connell, Catherine M.
Hufnagel, Katrin
Waterboer, Tim
Zheng, Xiaojing
A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women
title A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women
title_full A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women
title_fullStr A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women
title_full_unstemmed A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women
title_short A Machine Learning-Based Analytic Pipeline Applied to Clinical and Serum IgG Immunoproteome Data To Predict Chlamydia trachomatis Genital Tract Ascension and Incident Infection in Women
title_sort machine learning-based analytic pipeline applied to clinical and serum igg immunoproteome data to predict chlamydia trachomatis genital tract ascension and incident infection in women
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10434056/
https://www.ncbi.nlm.nih.gov/pubmed/37318345
http://dx.doi.org/10.1128/spectrum.04689-22
work_keys_str_mv AT liuchuwen amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT mokashinehavivek amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT darvilletoni amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT sunxuejun amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT oconnellcatherinem amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT hufnagelkatrin amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT waterboertim amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT zhengxiaojing amachinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT liuchuwen machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT mokashinehavivek machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT darvilletoni machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT sunxuejun machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT oconnellcatherinem machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT hufnagelkatrin machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT waterboertim machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen
AT zhengxiaojing machinelearningbasedanalyticpipelineappliedtoclinicalandserumiggimmunoproteomedatatopredictchlamydiatrachomatisgenitaltractascensionandincidentinfectioninwomen