Cargando…

Tracking Major Sources of Water Contamination Using Machine Learning

Current microbial source tracking techniques that rely on grab samples analyzed by individual endpoint assays are inadequate to explain microbial sources across space and time. Modeling and predicting host sources of microbial contamination could add a useful tool for watershed management. In this s...

Descripción completa

Detalles Bibliográficos
Autores principales: Wu, Jianyong, Song, Conghe, Dubinsky, Eric A., Stewart, Jill R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7854693/
https://www.ncbi.nlm.nih.gov/pubmed/33552026
http://dx.doi.org/10.3389/fmicb.2020.616692
_version_ 1783646135649304576
author Wu, Jianyong
Song, Conghe
Dubinsky, Eric A.
Stewart, Jill R.
author_facet Wu, Jianyong
Song, Conghe
Dubinsky, Eric A.
Stewart, Jill R.
author_sort Wu, Jianyong
collection PubMed
description Current microbial source tracking techniques that rely on grab samples analyzed by individual endpoint assays are inadequate to explain microbial sources across space and time. Modeling and predicting host sources of microbial contamination could add a useful tool for watershed management. In this study, we tested and evaluated machine learning models to predict the major sources of microbial contamination in a watershed. We examined the relationship between microbial sources, land cover, weather, and hydrologic variables in a watershed in Northern California, United States. Six models, including K-nearest neighbors (KNN), Naïve Bayes, Support vector machine (SVM), simple neural network (NN), Random Forest, and XGBoost, were built to predict major microbial sources using land cover, weather and hydrologic variables. The results showed that these models successfully predicted microbial sources classified into two categories (human and non-human), with the average accuracy ranging from 69% (Naïve Bayes) to 88% (XGBoost). The area under curve (AUC) of the receiver operating characteristic (ROC) illustrated XGBoost had the best performance (average AUC = 0.88), followed by Random Forest (average AUC = 0.84), and KNN (average AUC = 0.74). The importance index obtained from Random Forest indicated that precipitation and temperature were the two most important factors to predict the dominant microbial source. These results suggest that machine learning models, particularly XGBoost, can predict the dominant sources of microbial contamination based on the relationship of microbial contaminants with daily weather and land cover, providing a powerful tool to understand microbial sources in water.
format Online
Article
Text
id pubmed-7854693
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-78546932021-02-04 Tracking Major Sources of Water Contamination Using Machine Learning Wu, Jianyong Song, Conghe Dubinsky, Eric A. Stewart, Jill R. Front Microbiol Microbiology Current microbial source tracking techniques that rely on grab samples analyzed by individual endpoint assays are inadequate to explain microbial sources across space and time. Modeling and predicting host sources of microbial contamination could add a useful tool for watershed management. In this study, we tested and evaluated machine learning models to predict the major sources of microbial contamination in a watershed. We examined the relationship between microbial sources, land cover, weather, and hydrologic variables in a watershed in Northern California, United States. Six models, including K-nearest neighbors (KNN), Naïve Bayes, Support vector machine (SVM), simple neural network (NN), Random Forest, and XGBoost, were built to predict major microbial sources using land cover, weather and hydrologic variables. The results showed that these models successfully predicted microbial sources classified into two categories (human and non-human), with the average accuracy ranging from 69% (Naïve Bayes) to 88% (XGBoost). The area under curve (AUC) of the receiver operating characteristic (ROC) illustrated XGBoost had the best performance (average AUC = 0.88), followed by Random Forest (average AUC = 0.84), and KNN (average AUC = 0.74). The importance index obtained from Random Forest indicated that precipitation and temperature were the two most important factors to predict the dominant microbial source. These results suggest that machine learning models, particularly XGBoost, can predict the dominant sources of microbial contamination based on the relationship of microbial contaminants with daily weather and land cover, providing a powerful tool to understand microbial sources in water. Frontiers Media S.A. 2021-01-20 /pmc/articles/PMC7854693/ /pubmed/33552026 http://dx.doi.org/10.3389/fmicb.2020.616692 Text en Copyright © 2021 Wu, Song, Dubinsky and Stewart. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Microbiology
Wu, Jianyong
Song, Conghe
Dubinsky, Eric A.
Stewart, Jill R.
Tracking Major Sources of Water Contamination Using Machine Learning
title Tracking Major Sources of Water Contamination Using Machine Learning
title_full Tracking Major Sources of Water Contamination Using Machine Learning
title_fullStr Tracking Major Sources of Water Contamination Using Machine Learning
title_full_unstemmed Tracking Major Sources of Water Contamination Using Machine Learning
title_short Tracking Major Sources of Water Contamination Using Machine Learning
title_sort tracking major sources of water contamination using machine learning
topic Microbiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7854693/
https://www.ncbi.nlm.nih.gov/pubmed/33552026
http://dx.doi.org/10.3389/fmicb.2020.616692
work_keys_str_mv AT wujianyong trackingmajorsourcesofwatercontaminationusingmachinelearning
AT songconghe trackingmajorsourcesofwatercontaminationusingmachinelearning
AT dubinskyerica trackingmajorsourcesofwatercontaminationusingmachinelearning
AT stewartjillr trackingmajorsourcesofwatercontaminationusingmachinelearning