Cargando…

Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences

MOTIVATION: Tuberculosis (TB) is caused by members of the Mycobacterium tuberculosis complex (MTBC), which has a strain- or lineage-based clonal population structure. The evolution of drug-resistance in the MTBC poses a threat to successful treatment and eradication of TB. Machine learning approache...

Descripción completa

Detalles Bibliográficos
Autores principales: Billows, Nina, Phelan, Jody E, Xia, Dong, Peng, Yonghong, Clark, Taane G, Chang, Yu-Mei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10351970/
https://www.ncbi.nlm.nih.gov/pubmed/37428143
http://dx.doi.org/10.1093/bioinformatics/btad428
_version_ 1785074418345574400
author Billows, Nina
Phelan, Jody E
Xia, Dong
Peng, Yonghong
Clark, Taane G
Chang, Yu-Mei
author_facet Billows, Nina
Phelan, Jody E
Xia, Dong
Peng, Yonghong
Clark, Taane G
Chang, Yu-Mei
author_sort Billows, Nina
collection PubMed
description MOTIVATION: Tuberculosis (TB) is caused by members of the Mycobacterium tuberculosis complex (MTBC), which has a strain- or lineage-based clonal population structure. The evolution of drug-resistance in the MTBC poses a threat to successful treatment and eradication of TB. Machine learning approaches are being increasingly adopted to predict drug-resistance and characterize underlying mutations from whole genome sequences. However, such approaches may not generalize well in clinical practice due to confounding from the population structure of the MTBC. RESULTS: To investigate how population structure affects machine learning prediction, we compared three different approaches to reduce lineage dependency in random forest (RF) models, including stratification, feature selection, and feature weighted models. All RF models achieved moderate-high performance (area under the ROC curve range: 0.60–0.98). First-line drugs had higher performance than second-line drugs, but it varied depending on the lineages in the training dataset. Lineage-specific models generally had higher sensitivity than global models which may be underpinned by strain-specific drug-resistance mutations or sampling effects. The application of feature weights and feature selection approaches reduced lineage dependency in the model and had comparable performance to unweighted RF models. AVAILABILITY AND IMPLEMENTATION: https://github.com/NinaMercedes/RF_lineages.
format Online
Article
Text
id pubmed-10351970
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103519702023-07-18 Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences Billows, Nina Phelan, Jody E Xia, Dong Peng, Yonghong Clark, Taane G Chang, Yu-Mei Bioinformatics Original Paper MOTIVATION: Tuberculosis (TB) is caused by members of the Mycobacterium tuberculosis complex (MTBC), which has a strain- or lineage-based clonal population structure. The evolution of drug-resistance in the MTBC poses a threat to successful treatment and eradication of TB. Machine learning approaches are being increasingly adopted to predict drug-resistance and characterize underlying mutations from whole genome sequences. However, such approaches may not generalize well in clinical practice due to confounding from the population structure of the MTBC. RESULTS: To investigate how population structure affects machine learning prediction, we compared three different approaches to reduce lineage dependency in random forest (RF) models, including stratification, feature selection, and feature weighted models. All RF models achieved moderate-high performance (area under the ROC curve range: 0.60–0.98). First-line drugs had higher performance than second-line drugs, but it varied depending on the lineages in the training dataset. Lineage-specific models generally had higher sensitivity than global models which may be underpinned by strain-specific drug-resistance mutations or sampling effects. The application of feature weights and feature selection approaches reduced lineage dependency in the model and had comparable performance to unweighted RF models. AVAILABILITY AND IMPLEMENTATION: https://github.com/NinaMercedes/RF_lineages. Oxford University Press 2023-07-10 /pmc/articles/PMC10351970/ /pubmed/37428143 http://dx.doi.org/10.1093/bioinformatics/btad428 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Billows, Nina
Phelan, Jody E
Xia, Dong
Peng, Yonghong
Clark, Taane G
Chang, Yu-Mei
Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences
title Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences
title_full Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences
title_fullStr Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences
title_full_unstemmed Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences
title_short Feature weighted models to address lineage dependency in drug-resistance prediction from Mycobacterium tuberculosis genome sequences
title_sort feature weighted models to address lineage dependency in drug-resistance prediction from mycobacterium tuberculosis genome sequences
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10351970/
https://www.ncbi.nlm.nih.gov/pubmed/37428143
http://dx.doi.org/10.1093/bioinformatics/btad428
work_keys_str_mv AT billowsnina featureweightedmodelstoaddresslineagedependencyindrugresistancepredictionfrommycobacteriumtuberculosisgenomesequences
AT phelanjodye featureweightedmodelstoaddresslineagedependencyindrugresistancepredictionfrommycobacteriumtuberculosisgenomesequences
AT xiadong featureweightedmodelstoaddresslineagedependencyindrugresistancepredictionfrommycobacteriumtuberculosisgenomesequences
AT pengyonghong featureweightedmodelstoaddresslineagedependencyindrugresistancepredictionfrommycobacteriumtuberculosisgenomesequences
AT clarktaaneg featureweightedmodelstoaddresslineagedependencyindrugresistancepredictionfrommycobacteriumtuberculosisgenomesequences
AT changyumei featureweightedmodelstoaddresslineagedependencyindrugresistancepredictionfrommycobacteriumtuberculosisgenomesequences