Cargando…

Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach

BACKGROUND: Although previous epidemiological studies have examined the potential risk factors that increase the likelihood of acquiring Helicobacter pylori infections, most of these analyses have utilized conventional statistical models, including logistic regression, and have not benefited from ad...

Descripción completa

Detalles Bibliográficos
Autores principales: Tran, Van, Saad, Tazmilur, Tesfaye, Mehret, Walelign, Sosina, Wordofa, Moges, Abera, Dessie, Desta, Kassu, Tsegaye, Aster, Ay, Ahmet, Taye, Bineyam
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9330977/
https://www.ncbi.nlm.nih.gov/pubmed/35902812
http://dx.doi.org/10.1186/s12879-022-07625-7
_version_ 1784758292503855104
author Tran, Van
Saad, Tazmilur
Tesfaye, Mehret
Walelign, Sosina
Wordofa, Moges
Abera, Dessie
Desta, Kassu
Tsegaye, Aster
Ay, Ahmet
Taye, Bineyam
author_facet Tran, Van
Saad, Tazmilur
Tesfaye, Mehret
Walelign, Sosina
Wordofa, Moges
Abera, Dessie
Desta, Kassu
Tsegaye, Aster
Ay, Ahmet
Taye, Bineyam
author_sort Tran, Van
collection PubMed
description BACKGROUND: Although previous epidemiological studies have examined the potential risk factors that increase the likelihood of acquiring Helicobacter pylori infections, most of these analyses have utilized conventional statistical models, including logistic regression, and have not benefited from advanced machine learning techniques. OBJECTIVE: We examined H. pylori infection risk factors among school children using machine learning algorithms to identify important risk factors as well as to determine whether machine learning can be used to predict H. pylori infection status. METHODS: We applied feature selection and classification algorithms to data from a school-based cross-sectional survey in Ethiopia. The data set included 954 school children with 27 sociodemographic and lifestyle variables. We conducted five runs of tenfold cross-validation on the data. We combined the results of these runs for each combination of feature selection (e.g., Information Gain) and classification (e.g., Support Vector Machines) algorithms. RESULTS: The XGBoost classifier had the highest accuracy in predicting H. pylori infection status with an accuracy of 77%—a 13% improvement from the baseline accuracy of guessing the most frequent class (64% of the samples were H. Pylori negative.) K-Nearest Neighbors showed the worst performance across all classifiers. A similar performance was observed using the F1-score and area under the receiver operating curve (AUROC) classifier evaluation metrics. Among all features, place of residence (with urban residence increasing risk) was the most common risk factor for H. pylori infection, regardless of the feature selection method choice. Additionally, our machine learning algorithms identified other important risk factors for H. pylori infection, such as; electricity usage in the home, toilet type, and waste disposal location. Using a 75% cutoff for robustness, machine learning identified five of the eight significant features found by traditional multivariate logistic regression. However, when a lower robustness threshold is used, machine learning approaches identified more H. pylori risk factors than multivariate logistic regression and suggested risk factors not detected by logistic regression. CONCLUSION: This study provides evidence that machine learning approaches are positioned to uncover H. pylori infection risk factors and predict H. pylori infection status. These approaches identify similar risk factors and predict infection with comparable accuracy to logistic regression, thus they could be used as an alternative method. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12879-022-07625-7.
format Online
Article
Text
id pubmed-9330977
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-93309772022-07-28 Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach Tran, Van Saad, Tazmilur Tesfaye, Mehret Walelign, Sosina Wordofa, Moges Abera, Dessie Desta, Kassu Tsegaye, Aster Ay, Ahmet Taye, Bineyam BMC Infect Dis Research BACKGROUND: Although previous epidemiological studies have examined the potential risk factors that increase the likelihood of acquiring Helicobacter pylori infections, most of these analyses have utilized conventional statistical models, including logistic regression, and have not benefited from advanced machine learning techniques. OBJECTIVE: We examined H. pylori infection risk factors among school children using machine learning algorithms to identify important risk factors as well as to determine whether machine learning can be used to predict H. pylori infection status. METHODS: We applied feature selection and classification algorithms to data from a school-based cross-sectional survey in Ethiopia. The data set included 954 school children with 27 sociodemographic and lifestyle variables. We conducted five runs of tenfold cross-validation on the data. We combined the results of these runs for each combination of feature selection (e.g., Information Gain) and classification (e.g., Support Vector Machines) algorithms. RESULTS: The XGBoost classifier had the highest accuracy in predicting H. pylori infection status with an accuracy of 77%—a 13% improvement from the baseline accuracy of guessing the most frequent class (64% of the samples were H. Pylori negative.) K-Nearest Neighbors showed the worst performance across all classifiers. A similar performance was observed using the F1-score and area under the receiver operating curve (AUROC) classifier evaluation metrics. Among all features, place of residence (with urban residence increasing risk) was the most common risk factor for H. pylori infection, regardless of the feature selection method choice. Additionally, our machine learning algorithms identified other important risk factors for H. pylori infection, such as; electricity usage in the home, toilet type, and waste disposal location. Using a 75% cutoff for robustness, machine learning identified five of the eight significant features found by traditional multivariate logistic regression. However, when a lower robustness threshold is used, machine learning approaches identified more H. pylori risk factors than multivariate logistic regression and suggested risk factors not detected by logistic regression. CONCLUSION: This study provides evidence that machine learning approaches are positioned to uncover H. pylori infection risk factors and predict H. pylori infection status. These approaches identify similar risk factors and predict infection with comparable accuracy to logistic regression, thus they could be used as an alternative method. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12879-022-07625-7. BioMed Central 2022-07-28 /pmc/articles/PMC9330977/ /pubmed/35902812 http://dx.doi.org/10.1186/s12879-022-07625-7 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Tran, Van
Saad, Tazmilur
Tesfaye, Mehret
Walelign, Sosina
Wordofa, Moges
Abera, Dessie
Desta, Kassu
Tsegaye, Aster
Ay, Ahmet
Taye, Bineyam
Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach
title Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach
title_full Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach
title_fullStr Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach
title_full_unstemmed Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach
title_short Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach
title_sort helicobacter pylori (h. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9330977/
https://www.ncbi.nlm.nih.gov/pubmed/35902812
http://dx.doi.org/10.1186/s12879-022-07625-7
work_keys_str_mv AT tranvan helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT saadtazmilur helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT tesfayemehret helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT walelignsosina helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT wordofamoges helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT aberadessie helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT destakassu helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT tsegayeaster helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT ayahmet helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach
AT tayebineyam helicobacterpylorihpyloririskfactoranalysisandprevalencepredictionamachinelearningbasedapproach