Cargando…

Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data

Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order...

Descripción completa

Detalles Bibliográficos
Autores principales: Dousti Mousavi, Niloufar, Aldirawi, Hani, Yang, Jie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443356/
https://www.ncbi.nlm.nih.gov/pubmed/37606439
http://dx.doi.org/10.3390/biotech12030052
_version_ 1785093813177417728
author Dousti Mousavi, Niloufar
Aldirawi, Hani
Yang, Jie
author_facet Dousti Mousavi, Niloufar
Aldirawi, Hani
Yang, Jie
author_sort Dousti Mousavi, Niloufar
collection PubMed
description Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures.
format Online
Article
Text
id pubmed-10443356
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-104433562023-08-23 Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data Dousti Mousavi, Niloufar Aldirawi, Hani Yang, Jie BioTech (Basel) Article Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures. MDPI 2023-07-27 /pmc/articles/PMC10443356/ /pubmed/37606439 http://dx.doi.org/10.3390/biotech12030052 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Dousti Mousavi, Niloufar
Aldirawi, Hani
Yang, Jie
Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
title Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
title_full Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
title_fullStr Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
title_full_unstemmed Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
title_short Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data
title_sort categorical data analysis for high-dimensional sparse gene expression data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10443356/
https://www.ncbi.nlm.nih.gov/pubmed/37606439
http://dx.doi.org/10.3390/biotech12030052
work_keys_str_mv AT doustimousaviniloufar categoricaldataanalysisforhighdimensionalsparsegeneexpressiondata
AT aldirawihani categoricaldataanalysisforhighdimensionalsparsegeneexpressiondata
AT yangjie categoricaldataanalysisforhighdimensionalsparsegeneexpressiondata