Cargando…

Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes

SIMPLE SUMMARY: We developed a predictive approach using different machine learning methods to identify a number of genes that can potentially serve as novel diagnostic colon cancer biomarkers. ABSTRACT: Background: Colorectal cancer (CRC) is the third leading cause of cancer-related death and the f...

Descripción completa

Detalles Bibliográficos
Autores principales: Koppad, Saraswati, Basava, Annappa, Nash, Katrina, Gkoutos, Georgios V., Acharjee, Animesh
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8944988/
https://www.ncbi.nlm.nih.gov/pubmed/35336739
http://dx.doi.org/10.3390/biology11030365
_version_ 1784673843908968448
author Koppad, Saraswati
Basava, Annappa
Nash, Katrina
Gkoutos, Georgios V.
Acharjee, Animesh
author_facet Koppad, Saraswati
Basava, Annappa
Nash, Katrina
Gkoutos, Georgios V.
Acharjee, Animesh
author_sort Koppad, Saraswati
collection PubMed
description SIMPLE SUMMARY: We developed a predictive approach using different machine learning methods to identify a number of genes that can potentially serve as novel diagnostic colon cancer biomarkers. ABSTRACT: Background: Colorectal cancer (CRC) is the third leading cause of cancer-related death and the fourth most commonly diagnosed cancer worldwide. Due to a lack of diagnostic biomarkers and understanding of the underlying molecular mechanisms, CRC’s mortality rate continues to grow. CRC occurrence and progression are dynamic processes. The expression levels of specific molecules vary at various stages of CRC, rendering its early detection and diagnosis challenging and the need for identifying accurate and meaningful CRC biomarkers more pressing. The advances in high-throughput sequencing technologies have been used to explore novel gene expression, targeted treatments, and colon cancer pathogenesis. Such approaches are routinely being applied and result in large datasets whose analysis is increasingly becoming dependent on machine learning (ML) algorithms that have been demonstrated to be computationally efficient platforms for the identification of variables across such high-dimensional datasets. Methods: We developed a novel ML-based experimental design to study CRC gene associations. Six different machine learning methods were employed as classifiers to identify genes that can be used as diagnostics for CRC using gene expression and clinical datasets. The accuracy, sensitivity, specificity, F1 score, and area under receiver operating characteristic (AUROC) curve were derived to explore the differentially expressed genes (DEGs) for CRC diagnosis. Gene ontology enrichment analyses of these DEGs were performed and predicted gene signatures were linked with miRNAs. Results: We evaluated six machine learning classification methods (Adaboost, ExtraTrees, logistic regression, naïve Bayes classifier, random forest, and XGBoost) across different combinations of training and test datasets over GEO datasets. The accuracy and the AUROC of each combination of training and test data with different algorithms were used as comparison metrics. Random forest (RF) models consistently performed better than other models. In total, 34 genes were identified and used for pathway and gene set enrichment analysis. Further mapping of the 34 genes with miRNA identified interesting miRNA hubs genes. Conclusions: We identified 34 genes with high accuracy that can be used as a diagnostics panel for CRC.
format Online
Article
Text
id pubmed-8944988
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-89449882022-03-25 Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes Koppad, Saraswati Basava, Annappa Nash, Katrina Gkoutos, Georgios V. Acharjee, Animesh Biology (Basel) Article SIMPLE SUMMARY: We developed a predictive approach using different machine learning methods to identify a number of genes that can potentially serve as novel diagnostic colon cancer biomarkers. ABSTRACT: Background: Colorectal cancer (CRC) is the third leading cause of cancer-related death and the fourth most commonly diagnosed cancer worldwide. Due to a lack of diagnostic biomarkers and understanding of the underlying molecular mechanisms, CRC’s mortality rate continues to grow. CRC occurrence and progression are dynamic processes. The expression levels of specific molecules vary at various stages of CRC, rendering its early detection and diagnosis challenging and the need for identifying accurate and meaningful CRC biomarkers more pressing. The advances in high-throughput sequencing technologies have been used to explore novel gene expression, targeted treatments, and colon cancer pathogenesis. Such approaches are routinely being applied and result in large datasets whose analysis is increasingly becoming dependent on machine learning (ML) algorithms that have been demonstrated to be computationally efficient platforms for the identification of variables across such high-dimensional datasets. Methods: We developed a novel ML-based experimental design to study CRC gene associations. Six different machine learning methods were employed as classifiers to identify genes that can be used as diagnostics for CRC using gene expression and clinical datasets. The accuracy, sensitivity, specificity, F1 score, and area under receiver operating characteristic (AUROC) curve were derived to explore the differentially expressed genes (DEGs) for CRC diagnosis. Gene ontology enrichment analyses of these DEGs were performed and predicted gene signatures were linked with miRNAs. Results: We evaluated six machine learning classification methods (Adaboost, ExtraTrees, logistic regression, naïve Bayes classifier, random forest, and XGBoost) across different combinations of training and test datasets over GEO datasets. The accuracy and the AUROC of each combination of training and test data with different algorithms were used as comparison metrics. Random forest (RF) models consistently performed better than other models. In total, 34 genes were identified and used for pathway and gene set enrichment analysis. Further mapping of the 34 genes with miRNA identified interesting miRNA hubs genes. Conclusions: We identified 34 genes with high accuracy that can be used as a diagnostics panel for CRC. MDPI 2022-02-25 /pmc/articles/PMC8944988/ /pubmed/35336739 http://dx.doi.org/10.3390/biology11030365 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Koppad, Saraswati
Basava, Annappa
Nash, Katrina
Gkoutos, Georgios V.
Acharjee, Animesh
Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes
title Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes
title_full Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes
title_fullStr Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes
title_full_unstemmed Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes
title_short Machine Learning-Based Identification of Colon Cancer Candidate Diagnostics Genes
title_sort machine learning-based identification of colon cancer candidate diagnostics genes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8944988/
https://www.ncbi.nlm.nih.gov/pubmed/35336739
http://dx.doi.org/10.3390/biology11030365
work_keys_str_mv AT koppadsaraswati machinelearningbasedidentificationofcoloncancercandidatediagnosticsgenes
AT basavaannappa machinelearningbasedidentificationofcoloncancercandidatediagnosticsgenes
AT nashkatrina machinelearningbasedidentificationofcoloncancercandidatediagnosticsgenes
AT gkoutosgeorgiosv machinelearningbasedidentificationofcoloncancercandidatediagnosticsgenes
AT acharjeeanimesh machinelearningbasedidentificationofcoloncancercandidatediagnosticsgenes