Cargando…
Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443356/ https://www.ncbi.nlm.nih.gov/pubmed/37606439 http://dx.doi.org/10.3390/biotech12030052 |
_version_ | 1785093813177417728 |
---|---|
author | Dousti Mousavi, Niloufar Aldirawi, Hani Yang, Jie |
author_facet | Dousti Mousavi, Niloufar Aldirawi, Hani Yang, Jie |
author_sort | Dousti Mousavi, Niloufar |
collection | PubMed |
description | Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures. |
format | Online Article Text |
id | pubmed-10443356 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-104433562023-08-23 Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data Dousti Mousavi, Niloufar Aldirawi, Hani Yang, Jie BioTech (Basel) Article Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures. MDPI 2023-07-27 /pmc/articles/PMC10443356/ /pubmed/37606439 http://dx.doi.org/10.3390/biotech12030052 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Dousti Mousavi, Niloufar Aldirawi, Hani Yang, Jie Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data |
title | Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data |
title_full | Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data |
title_fullStr | Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data |
title_full_unstemmed | Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data |
title_short | Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data |
title_sort | categorical data analysis for high-dimensional sparse gene expression data |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443356/ https://www.ncbi.nlm.nih.gov/pubmed/37606439 http://dx.doi.org/10.3390/biotech12030052 |
work_keys_str_mv | AT doustimousaviniloufar categoricaldataanalysisforhighdimensionalsparsegeneexpressiondata AT aldirawihani categoricaldataanalysisforhighdimensionalsparsegeneexpressiondata AT yangjie categoricaldataanalysisforhighdimensionalsparsegeneexpressiondata |