Cargando…

Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN

Effective and timely antibiotic treatment depends on accurate and rapid in silico antimicrobial-resistant (AMR) predictions. Existing statistical rule-based Mycobacterium tuberculosis (MTB) drug resistance prediction methods using bacterial genomic sequencing data often achieve varying results: high...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuang, Xingyan, Wang, Fan, Hernandez, Kyle M., Zhang, Zhenyu, Grossman, Robert L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8844416/
https://www.ncbi.nlm.nih.gov/pubmed/35165358
http://dx.doi.org/10.1038/s41598-022-06449-4
_version_ 1784651470676688896
author Kuang, Xingyan
Wang, Fan
Hernandez, Kyle M.
Zhang, Zhenyu
Grossman, Robert L.
author_facet Kuang, Xingyan
Wang, Fan
Hernandez, Kyle M.
Zhang, Zhenyu
Grossman, Robert L.
author_sort Kuang, Xingyan
collection PubMed
description Effective and timely antibiotic treatment depends on accurate and rapid in silico antimicrobial-resistant (AMR) predictions. Existing statistical rule-based Mycobacterium tuberculosis (MTB) drug resistance prediction methods using bacterial genomic sequencing data often achieve varying results: high accuracy on some antibiotics but relatively low accuracy on others. Traditional machine learning (ML) approaches have been applied to classify drug resistance for MTB and have shown more stable performance. However, there is no study that uses deep learning architecture like Convolutional Neural Network (CNN) on a large and diverse cohort of MTB samples for AMR prediction. We developed 24 binary classifiers of MTB drug resistance status across eight anti-MTB drugs and three different ML algorithms: logistic regression, random forest and 1D CNN using a training dataset of 10,575 MTB isolates collected from 16 countries across six continents, where an extended pan-genome reference was used for detecting genetic features. Our 1D CNN architecture was designed to integrate both sequential and non-sequential features. In terms of F1-scores, 1D CNN models are our best classifiers that are also more accurate and stable than the state-of-the-art rule-based tool Mykrobe predictor (81.1 to 93.8%, 93.7 to 96.2%, 93.1 to 94.8%, 95.9 to 97.2% and 97.1 to 98.2% for ethambutol, rifampicin, pyrazinamide, isoniazid and ofloxacin respectively). We applied filter-based feature selection to find AMR relevant features. All selected variant features are AMR-related ones in CARD database. 78.8% of them are also in the catalogue of MTB mutations that were recently identified as drug resistance-associated ones by WHO. To facilitate ML model development for AMR prediction, we packaged every step into an automated pipeline and shared the source code at https://github.com/KuangXY3/MTB-AMR-classification-CNN.
format Online
Article
Text
id pubmed-8844416
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-88444162022-02-16 Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN Kuang, Xingyan Wang, Fan Hernandez, Kyle M. Zhang, Zhenyu Grossman, Robert L. Sci Rep Article Effective and timely antibiotic treatment depends on accurate and rapid in silico antimicrobial-resistant (AMR) predictions. Existing statistical rule-based Mycobacterium tuberculosis (MTB) drug resistance prediction methods using bacterial genomic sequencing data often achieve varying results: high accuracy on some antibiotics but relatively low accuracy on others. Traditional machine learning (ML) approaches have been applied to classify drug resistance for MTB and have shown more stable performance. However, there is no study that uses deep learning architecture like Convolutional Neural Network (CNN) on a large and diverse cohort of MTB samples for AMR prediction. We developed 24 binary classifiers of MTB drug resistance status across eight anti-MTB drugs and three different ML algorithms: logistic regression, random forest and 1D CNN using a training dataset of 10,575 MTB isolates collected from 16 countries across six continents, where an extended pan-genome reference was used for detecting genetic features. Our 1D CNN architecture was designed to integrate both sequential and non-sequential features. In terms of F1-scores, 1D CNN models are our best classifiers that are also more accurate and stable than the state-of-the-art rule-based tool Mykrobe predictor (81.1 to 93.8%, 93.7 to 96.2%, 93.1 to 94.8%, 95.9 to 97.2% and 97.1 to 98.2% for ethambutol, rifampicin, pyrazinamide, isoniazid and ofloxacin respectively). We applied filter-based feature selection to find AMR relevant features. All selected variant features are AMR-related ones in CARD database. 78.8% of them are also in the catalogue of MTB mutations that were recently identified as drug resistance-associated ones by WHO. To facilitate ML model development for AMR prediction, we packaged every step into an automated pipeline and shared the source code at https://github.com/KuangXY3/MTB-AMR-classification-CNN. Nature Publishing Group UK 2022-02-14 /pmc/articles/PMC8844416/ /pubmed/35165358 http://dx.doi.org/10.1038/s41598-022-06449-4 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Kuang, Xingyan
Wang, Fan
Hernandez, Kyle M.
Zhang, Zhenyu
Grossman, Robert L.
Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN
title Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN
title_full Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN
title_fullStr Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN
title_full_unstemmed Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN
title_short Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN
title_sort accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and cnn
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8844416/
https://www.ncbi.nlm.nih.gov/pubmed/35165358
http://dx.doi.org/10.1038/s41598-022-06449-4
work_keys_str_mv AT kuangxingyan accurateandrapidpredictionoftuberculosisdrugresistancefromgenomesequencedatausingtraditionalmachinelearningalgorithmsandcnn
AT wangfan accurateandrapidpredictionoftuberculosisdrugresistancefromgenomesequencedatausingtraditionalmachinelearningalgorithmsandcnn
AT hernandezkylem accurateandrapidpredictionoftuberculosisdrugresistancefromgenomesequencedatausingtraditionalmachinelearningalgorithmsandcnn
AT zhangzhenyu accurateandrapidpredictionoftuberculosisdrugresistancefromgenomesequencedatausingtraditionalmachinelearningalgorithmsandcnn
AT grossmanrobertl accurateandrapidpredictionoftuberculosisdrugresistancefromgenomesequencedatausingtraditionalmachinelearningalgorithmsandcnn