Cargando…

Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest

Systemic lupus erythematosus (SLE) is a latent, insidious autoimmune disease, and with the development of gene sequencing in recent years, our study aims to develop a gene-based predictive model to explore the identification of SLE at the genetic level. First, gene expression datasets of SLE whole b...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Huajian, Huang, Li, Jiang, Xinyue, Wang, Yue, Bian, Yan, Ma, Shumei, Liu, Xiaodong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9667742/
https://www.ncbi.nlm.nih.gov/pubmed/36405750
http://dx.doi.org/10.3389/fimmu.2022.1025688
_version_ 1784831779302014976
author Chen, Huajian
Huang, Li
Jiang, Xinyue
Wang, Yue
Bian, Yan
Ma, Shumei
Liu, Xiaodong
author_facet Chen, Huajian
Huang, Li
Jiang, Xinyue
Wang, Yue
Bian, Yan
Ma, Shumei
Liu, Xiaodong
author_sort Chen, Huajian
collection PubMed
description Systemic lupus erythematosus (SLE) is a latent, insidious autoimmune disease, and with the development of gene sequencing in recent years, our study aims to develop a gene-based predictive model to explore the identification of SLE at the genetic level. First, gene expression datasets of SLE whole blood samples were collected from the Gene Expression Omnibus (GEO) database. After the datasets were merged, they were divided into training and validation datasets in the ratio of 7:3, where the SLE samples and healthy samples of the training dataset were 334 and 71, respectively, and the SLE samples and healthy samples of the validation dataset were 143 and 30, respectively. The training dataset was used to build the disease risk prediction model, and the validation dataset was used to verify the model identification ability. We first analyzed differentially expressed genes (DEGs) and then used Lasso and random forest (RF) to screen out six key genes (OAS3, USP18, RTP4, SPATS2L, IFI27 and OAS1), which are essential to distinguish SLE from healthy samples. With six key genes incorporated and five iterations of 10-fold cross-validation performed into the RF model, we finally determined the RF model with optimal mtry. The mean values of area under the curve (AUC) and accuracy of the models were over 0.95. The validation dataset was then used to evaluate the AUC performance and our model had an AUC of 0.948. An external validation dataset (GSE99967) with an AUC of 0.810, an accuracy of 0.836, and a sensitivity of 0.921 was used to assess the model’s performance. The external validation dataset (GSE185047) of all SLE patients yielded an SLE sensitivity of up to 0.954. The final high-throughput RF model had a mean value of AUC over 0.9, again showing good results. In conclusion, we identified key genetic biomarkers and successfully developed a novel disease risk prediction model for SLE that can be used as a new SLE disease risk prediction aid and contribute to the identification of SLE.
format Online
Article
Text
id pubmed-9667742
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-96677422022-11-17 Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest Chen, Huajian Huang, Li Jiang, Xinyue Wang, Yue Bian, Yan Ma, Shumei Liu, Xiaodong Front Immunol Immunology Systemic lupus erythematosus (SLE) is a latent, insidious autoimmune disease, and with the development of gene sequencing in recent years, our study aims to develop a gene-based predictive model to explore the identification of SLE at the genetic level. First, gene expression datasets of SLE whole blood samples were collected from the Gene Expression Omnibus (GEO) database. After the datasets were merged, they were divided into training and validation datasets in the ratio of 7:3, where the SLE samples and healthy samples of the training dataset were 334 and 71, respectively, and the SLE samples and healthy samples of the validation dataset were 143 and 30, respectively. The training dataset was used to build the disease risk prediction model, and the validation dataset was used to verify the model identification ability. We first analyzed differentially expressed genes (DEGs) and then used Lasso and random forest (RF) to screen out six key genes (OAS3, USP18, RTP4, SPATS2L, IFI27 and OAS1), which are essential to distinguish SLE from healthy samples. With six key genes incorporated and five iterations of 10-fold cross-validation performed into the RF model, we finally determined the RF model with optimal mtry. The mean values of area under the curve (AUC) and accuracy of the models were over 0.95. The validation dataset was then used to evaluate the AUC performance and our model had an AUC of 0.948. An external validation dataset (GSE99967) with an AUC of 0.810, an accuracy of 0.836, and a sensitivity of 0.921 was used to assess the model’s performance. The external validation dataset (GSE185047) of all SLE patients yielded an SLE sensitivity of up to 0.954. The final high-throughput RF model had a mean value of AUC over 0.9, again showing good results. In conclusion, we identified key genetic biomarkers and successfully developed a novel disease risk prediction model for SLE that can be used as a new SLE disease risk prediction aid and contribute to the identification of SLE. Frontiers Media S.A. 2022-11-01 /pmc/articles/PMC9667742/ /pubmed/36405750 http://dx.doi.org/10.3389/fimmu.2022.1025688 Text en Copyright © 2022 Chen, Huang, Jiang, Wang, Bian, Ma and Liu https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Immunology
Chen, Huajian
Huang, Li
Jiang, Xinyue
Wang, Yue
Bian, Yan
Ma, Shumei
Liu, Xiaodong
Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest
title Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest
title_full Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest
title_fullStr Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest
title_full_unstemmed Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest
title_short Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest
title_sort establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest
topic Immunology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9667742/
https://www.ncbi.nlm.nih.gov/pubmed/36405750
http://dx.doi.org/10.3389/fimmu.2022.1025688
work_keys_str_mv AT chenhuajian establishmentandanalysisofadiseaseriskpredictionmodelforthesystemiclupuserythematosuswithrandomforest
AT huangli establishmentandanalysisofadiseaseriskpredictionmodelforthesystemiclupuserythematosuswithrandomforest
AT jiangxinyue establishmentandanalysisofadiseaseriskpredictionmodelforthesystemiclupuserythematosuswithrandomforest
AT wangyue establishmentandanalysisofadiseaseriskpredictionmodelforthesystemiclupuserythematosuswithrandomforest
AT bianyan establishmentandanalysisofadiseaseriskpredictionmodelforthesystemiclupuserythematosuswithrandomforest
AT mashumei establishmentandanalysisofadiseaseriskpredictionmodelforthesystemiclupuserythematosuswithrandomforest
AT liuxiaodong establishmentandanalysisofadiseaseriskpredictionmodelforthesystemiclupuserythematosuswithrandomforest