Cargando…

Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data

BACKGROUND: Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn’s disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classif...

Descripción completa

Detalles Bibliográficos
Autores principales: Stafford, Imogen S, Ashton, James J, Mossotto, Enrico, Cheng, Guo, Mark Beattie, Robert, Ennis, Sarah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10637043/
https://www.ncbi.nlm.nih.gov/pubmed/37205778
http://dx.doi.org/10.1093/ecco-jcc/jjad084
_version_ 1785146475421892608
author Stafford, Imogen S
Ashton, James J
Mossotto, Enrico
Cheng, Guo
Mark Beattie, Robert
Ennis, Sarah
author_facet Stafford, Imogen S
Ashton, James J
Mossotto, Enrico
Cheng, Guo
Mark Beattie, Robert
Ennis, Sarah
author_sort Stafford, Imogen S
collection PubMed
description BACKGROUND: Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn’s disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype. METHODS: Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] ‘IBD’ genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset. RESULTS: A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC. DISCUSSION: We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification.
format Online
Article
Text
id pubmed-10637043
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-106370432023-11-15 Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data Stafford, Imogen S Ashton, James J Mossotto, Enrico Cheng, Guo Mark Beattie, Robert Ennis, Sarah J Crohns Colitis Original Articles BACKGROUND: Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn’s disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype. METHODS: Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] ‘IBD’ genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset. RESULTS: A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC. DISCUSSION: We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification. Oxford University Press 2023-05-19 /pmc/articles/PMC10637043/ /pubmed/37205778 http://dx.doi.org/10.1093/ecco-jcc/jjad084 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of European Crohn’s and Colitis Organisation. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Articles
Stafford, Imogen S
Ashton, James J
Mossotto, Enrico
Cheng, Guo
Mark Beattie, Robert
Ennis, Sarah
Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data
title Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data
title_full Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data
title_fullStr Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data
title_full_unstemmed Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data
title_short Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data
title_sort supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data
topic Original Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10637043/
https://www.ncbi.nlm.nih.gov/pubmed/37205778
http://dx.doi.org/10.1093/ecco-jcc/jjad084
work_keys_str_mv AT staffordimogens supervisedmachinelearningclassifiesinflammatoryboweldiseasepatientsbysubtypeusingwholeexomesequencingdata
AT ashtonjamesj supervisedmachinelearningclassifiesinflammatoryboweldiseasepatientsbysubtypeusingwholeexomesequencingdata
AT mossottoenrico supervisedmachinelearningclassifiesinflammatoryboweldiseasepatientsbysubtypeusingwholeexomesequencingdata
AT chengguo supervisedmachinelearningclassifiesinflammatoryboweldiseasepatientsbysubtypeusingwholeexomesequencingdata
AT markbeattierobert supervisedmachinelearningclassifiesinflammatoryboweldiseasepatientsbysubtypeusingwholeexomesequencingdata
AT ennissarah supervisedmachinelearningclassifiesinflammatoryboweldiseasepatientsbysubtypeusingwholeexomesequencingdata