Cargando…

Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers

Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional...

Descripción completa

Detalles Bibliográficos
Autores principales: C, Lavanya, S, Pooja, Kashyap, Abhay H, Rahaman, Abdur, Niranjan, Swarna, Niranjan, Vidya
Formato: Online Artículo Texto
Lenguaje:English
Publicado: SAGE Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126698/
https://www.ncbi.nlm.nih.gov/pubmed/37113644
http://dx.doi.org/10.1177/11769351231167992
_version_ 1785030313516204032
author C, Lavanya
S, Pooja
Kashyap, Abhay H
Rahaman, Abdur
Niranjan, Swarna
Niranjan, Vidya
author_facet C, Lavanya
S, Pooja
Kashyap, Abhay H
Rahaman, Abdur
Niranjan, Swarna
Niranjan, Vidya
author_sort C, Lavanya
collection PubMed
description Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model’s accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
format Online
Article
Text
id pubmed-10126698
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher SAGE Publications
record_format MEDLINE/PubMed
spelling pubmed-101266982023-04-26 Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers C, Lavanya S, Pooja Kashyap, Abhay H Rahaman, Abdur Niranjan, Swarna Niranjan, Vidya Cancer Inform Original Research Article Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model’s accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2. SAGE Publications 2023-04-21 /pmc/articles/PMC10126698/ /pubmed/37113644 http://dx.doi.org/10.1177/11769351231167992 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by-nc/4.0/This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle Original Research Article
C, Lavanya
S, Pooja
Kashyap, Abhay H
Rahaman, Abdur
Niranjan, Swarna
Niranjan, Vidya
Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
title Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
title_full Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
title_fullStr Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
title_full_unstemmed Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
title_short Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
title_sort novel biomarker prediction for lung cancer using random forest classifiers
topic Original Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126698/
https://www.ncbi.nlm.nih.gov/pubmed/37113644
http://dx.doi.org/10.1177/11769351231167992
work_keys_str_mv AT clavanya novelbiomarkerpredictionforlungcancerusingrandomforestclassifiers
AT spooja novelbiomarkerpredictionforlungcancerusingrandomforestclassifiers
AT kashyapabhayh novelbiomarkerpredictionforlungcancerusingrandomforestclassifiers
AT rahamanabdur novelbiomarkerpredictionforlungcancerusingrandomforestclassifiers
AT niranjanswarna novelbiomarkerpredictionforlungcancerusingrandomforestclassifiers
AT niranjanvidya novelbiomarkerpredictionforlungcancerusingrandomforestclassifiers