Cargando…

The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study

OBJECTIVE: In microarray datasets, hundreds and thousands of genes are measured in a small number of samples, and sometimes due to problems that occur during the experiment, the expression value of some genes is recorded as missing. It is a difficult task to determine the genes that cause disease or...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rabiei, Niloofar, Soltanian, Ali Reza, Farhadian, Maryam, Bahreini, Fatemeh
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Royan Institute 2023
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10257059/ https://www.ncbi.nlm.nih.gov/pubmed/37300296 http://dx.doi.org/10.22074/CELLJ.2023.1971852.1156

_version_	1785057234008408064
author	Rabiei, Niloofar Soltanian, Ali Reza Farhadian, Maryam Bahreini, Fatemeh
author_facet	Rabiei, Niloofar Soltanian, Ali Reza Farhadian, Maryam Bahreini, Fatemeh
author_sort	Rabiei, Niloofar
collection	PubMed
description	OBJECTIVE: In microarray datasets, hundreds and thousands of genes are measured in a small number of samples, and sometimes due to problems that occur during the experiment, the expression value of some genes is recorded as missing. It is a difficult task to determine the genes that cause disease or cancer from a large number of genes. This study aimed to find effective genes in pancreatic cancer (PC). First, the K-nearest neighbor (KNN) imputation method was used to solve the problem of missing values (MVs) of gene expression. Then, the random forest algorithm was used to identify the genes associated with PC. MATERIALS AND METHODS: In this retrospective study, 24 samples from the GSE14245 dataset were examined. Twelve samples were from patients with PC, and 12 samples were from healthy control. After preprocessing and applying the fold-change technique, 29482 genes were used. We used the KNN imputation method to impute when a particular gene had MVs. Then, the genes most strongly associated with PC were selected using the random forest algorithm. We classified the dataset using support vector machine (SVM) and naïve bayes (NB) classifiers, and F-score and Jaccard indices were reported. RESULTS: Out of the 29482 genes, 1185 genes with fold-changes greater than 3 were selected. After selecting the most associated genes, 21 genes with the most important value were identified. S100P and GPX3 had the highest and lowest importance values, respectively. The F-score and Jaccard value of the SVM and NB classifiers were 95.5, 93, 92, and 92 percent, respectively. CONCLUSION: This study is based on the application of the fold change technique, imputation method, and random forest algorithm and could find the most associated genes that were not identified in many studies. We therefore suggest researchers use the random forest algorithm to detect the related genes within the disease of interest.
format	Online Article Text
id	pubmed-10257059
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Royan Institute
record_format	MEDLINE/PubMed
spelling	pubmed-102570592023-06-11 The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study Rabiei, Niloofar Soltanian, Ali Reza Farhadian, Maryam Bahreini, Fatemeh Cell J Original Article OBJECTIVE: In microarray datasets, hundreds and thousands of genes are measured in a small number of samples, and sometimes due to problems that occur during the experiment, the expression value of some genes is recorded as missing. It is a difficult task to determine the genes that cause disease or cancer from a large number of genes. This study aimed to find effective genes in pancreatic cancer (PC). First, the K-nearest neighbor (KNN) imputation method was used to solve the problem of missing values (MVs) of gene expression. Then, the random forest algorithm was used to identify the genes associated with PC. MATERIALS AND METHODS: In this retrospective study, 24 samples from the GSE14245 dataset were examined. Twelve samples were from patients with PC, and 12 samples were from healthy control. After preprocessing and applying the fold-change technique, 29482 genes were used. We used the KNN imputation method to impute when a particular gene had MVs. Then, the genes most strongly associated with PC were selected using the random forest algorithm. We classified the dataset using support vector machine (SVM) and naïve bayes (NB) classifiers, and F-score and Jaccard indices were reported. RESULTS: Out of the 29482 genes, 1185 genes with fold-changes greater than 3 were selected. After selecting the most associated genes, 21 genes with the most important value were identified. S100P and GPX3 had the highest and lowest importance values, respectively. The F-score and Jaccard value of the SVM and NB classifiers were 95.5, 93, 92, and 92 percent, respectively. CONCLUSION: This study is based on the application of the fold change technique, imputation method, and random forest algorithm and could find the most associated genes that were not identified in many studies. We therefore suggest researchers use the random forest algorithm to detect the related genes within the disease of interest. Royan Institute 2023-05 2023-05-28 /pmc/articles/PMC10257059/ /pubmed/37300296 http://dx.doi.org/10.22074/CELLJ.2023.1971852.1156 Text en Any use, distribution, reproduction or abstract of this publication in any medium, with the exception of commercial purposes, is permitted provided the original work is properly cited. https://creativecommons.org/licenses/by-nc/3.0/This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial 3.0 (CC BY-NC 3.0) License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Rabiei, Niloofar Soltanian, Ali Reza Farhadian, Maryam Bahreini, Fatemeh The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study
title	The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study
title_full	The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study
title_fullStr	The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study
title_full_unstemmed	The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study
title_short	The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study
title_sort	performance evaluation of the random forest algorithm for a gene selection in identifying genes associated with resectable pancreatic cancer in microarray dataset: a retrospective study
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10257059/ https://www.ncbi.nlm.nih.gov/pubmed/37300296 http://dx.doi.org/10.22074/CELLJ.2023.1971852.1156
work_keys_str_mv	AT rabieiniloofar theperformanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy AT soltanianalireza theperformanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy AT farhadianmaryam theperformanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy AT bahreinifatemeh theperformanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy AT rabieiniloofar performanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy AT soltanianalireza performanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy AT farhadianmaryam performanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy AT bahreinifatemeh performanceevaluationoftherandomforestalgorithmforageneselectioninidentifyinggenesassociatedwithresectablepancreaticcancerinmicroarraydatasetaretrospectivestudy

The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study

Ejemplares similares