Cargando…

Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning

BACKGROUND: Amyotrophic lateral sclerosis (ALS) is a rare progressive neurodegenerative disease that affects upper and lower motor neurons. As the molecular basis of the disease is still elusive, the development of high-throughput sequencing technologies, combined with data mining techniques and mac...

Descripción completa

Detalles Bibliográficos
Autores principales: Founta, Kyriaki, Dafou, Dimitra, Kanata, Eirini, Sklaviadis, Theodoros, Zanos, Theodoros P., Gounaris, Anastasios, Xanthopoulos, Konstantinos
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9872307/
https://www.ncbi.nlm.nih.gov/pubmed/36694130
http://dx.doi.org/10.1186/s10020-023-00603-y
_version_ 1784877375691948032
author Founta, Kyriaki
Dafou, Dimitra
Kanata, Eirini
Sklaviadis, Theodoros
Zanos, Theodoros P.
Gounaris, Anastasios
Xanthopoulos, Konstantinos
author_facet Founta, Kyriaki
Dafou, Dimitra
Kanata, Eirini
Sklaviadis, Theodoros
Zanos, Theodoros P.
Gounaris, Anastasios
Xanthopoulos, Konstantinos
author_sort Founta, Kyriaki
collection PubMed
description BACKGROUND: Amyotrophic lateral sclerosis (ALS) is a rare progressive neurodegenerative disease that affects upper and lower motor neurons. As the molecular basis of the disease is still elusive, the development of high-throughput sequencing technologies, combined with data mining techniques and machine learning methods, could provide remarkable results in identifying pathogenetic mechanisms. High dimensionality is a major problem when applying machine learning techniques in biomedical data analysis, since a huge number of features is available for a limited number of samples. The aim of this study was to develop a methodology for training interpretable machine learning models in the classification of ALS and ALS-subtypes samples, using gene expression datasets. METHODS: We performed dimensionality reduction in gene expression data using a semi-automated preprocessing systematic gene selection procedure using Statistically Equivalent Signature (SES), a causality-based feature selection algorithm, followed by Boosted Regression Trees (XGBoost) and Random Forest to train the machine learning classifiers. The SHapley Additive exPlanations (SHAP values) were used for interpretation of the machine learning classifiers. The methodology was developed and tested using two distinct publicly available ALS RNA-seq datasets. We evaluated the performance of SES as a dimensionality reduction method against: (a) Least Absolute Shrinkage and Selection Operator (LASSO), and (b) Local Outlier Factor (LOF). RESULTS: The proposed methodology achieved 85.18% accuracy for the classification of cerebellum or frontal cortex samples as C9orf72-related familial ALS, sporadic ALS or healthy samples. Importantly, the genes identified as the most determinative have also been reported as disease-associated in ALS literature. When tested in the evaluation dataset, the methodology achieved 88.89% accuracy for the classification of sporadic ALS motor neuron samples. When LASSO was used as feature selection method instead of SES, the accuracy of the machine learning classifiers ranged from 74.07 to 96.30%, depending on tissue assessed, while LOF underperformed significantly (77.78% accuracy for the classification of pooled cerebellum and frontal cortex samples). CONCLUSIONS: Using SES, we addressed the challenge of high dimensionality in gene expression data analysis, and we trained accurate machine learning ALS classifiers, specific for the gene expression patterns of different disease subtypes and tissue samples, while identifying disease-associated genes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s10020-023-00603-y.
format Online
Article
Text
id pubmed-9872307
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-98723072023-01-25 Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning Founta, Kyriaki Dafou, Dimitra Kanata, Eirini Sklaviadis, Theodoros Zanos, Theodoros P. Gounaris, Anastasios Xanthopoulos, Konstantinos Mol Med Research Article BACKGROUND: Amyotrophic lateral sclerosis (ALS) is a rare progressive neurodegenerative disease that affects upper and lower motor neurons. As the molecular basis of the disease is still elusive, the development of high-throughput sequencing technologies, combined with data mining techniques and machine learning methods, could provide remarkable results in identifying pathogenetic mechanisms. High dimensionality is a major problem when applying machine learning techniques in biomedical data analysis, since a huge number of features is available for a limited number of samples. The aim of this study was to develop a methodology for training interpretable machine learning models in the classification of ALS and ALS-subtypes samples, using gene expression datasets. METHODS: We performed dimensionality reduction in gene expression data using a semi-automated preprocessing systematic gene selection procedure using Statistically Equivalent Signature (SES), a causality-based feature selection algorithm, followed by Boosted Regression Trees (XGBoost) and Random Forest to train the machine learning classifiers. The SHapley Additive exPlanations (SHAP values) were used for interpretation of the machine learning classifiers. The methodology was developed and tested using two distinct publicly available ALS RNA-seq datasets. We evaluated the performance of SES as a dimensionality reduction method against: (a) Least Absolute Shrinkage and Selection Operator (LASSO), and (b) Local Outlier Factor (LOF). RESULTS: The proposed methodology achieved 85.18% accuracy for the classification of cerebellum or frontal cortex samples as C9orf72-related familial ALS, sporadic ALS or healthy samples. Importantly, the genes identified as the most determinative have also been reported as disease-associated in ALS literature. When tested in the evaluation dataset, the methodology achieved 88.89% accuracy for the classification of sporadic ALS motor neuron samples. When LASSO was used as feature selection method instead of SES, the accuracy of the machine learning classifiers ranged from 74.07 to 96.30%, depending on tissue assessed, while LOF underperformed significantly (77.78% accuracy for the classification of pooled cerebellum and frontal cortex samples). CONCLUSIONS: Using SES, we addressed the challenge of high dimensionality in gene expression data analysis, and we trained accurate machine learning ALS classifiers, specific for the gene expression patterns of different disease subtypes and tissue samples, while identifying disease-associated genes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s10020-023-00603-y. BioMed Central 2023-01-24 /pmc/articles/PMC9872307/ /pubmed/36694130 http://dx.doi.org/10.1186/s10020-023-00603-y Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research Article
Founta, Kyriaki
Dafou, Dimitra
Kanata, Eirini
Sklaviadis, Theodoros
Zanos, Theodoros P.
Gounaris, Anastasios
Xanthopoulos, Konstantinos
Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning
title Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning
title_full Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning
title_fullStr Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning
title_full_unstemmed Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning
title_short Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning
title_sort gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9872307/
https://www.ncbi.nlm.nih.gov/pubmed/36694130
http://dx.doi.org/10.1186/s10020-023-00603-y
work_keys_str_mv AT fountakyriaki genetargetinginamyotrophiclateralsclerosisusingcausalitybasedfeatureselectionandmachinelearning
AT dafoudimitra genetargetinginamyotrophiclateralsclerosisusingcausalitybasedfeatureselectionandmachinelearning
AT kanataeirini genetargetinginamyotrophiclateralsclerosisusingcausalitybasedfeatureselectionandmachinelearning
AT sklaviadistheodoros genetargetinginamyotrophiclateralsclerosisusingcausalitybasedfeatureselectionandmachinelearning
AT zanostheodorosp genetargetinginamyotrophiclateralsclerosisusingcausalitybasedfeatureselectionandmachinelearning
AT gounarisanastasios genetargetinginamyotrophiclateralsclerosisusingcausalitybasedfeatureselectionandmachinelearning
AT xanthopouloskonstantinos genetargetinginamyotrophiclateralsclerosisusingcausalitybasedfeatureselectionandmachinelearning