Cargando…

Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification

The present study aimed to identify the feature genes associated with smoking in lung adenocarcinoma (LAC) samples and explore the underlying mechanism. Three gene expression datasets of LAC samples were downloaded from the Gene Expression Omnibus database through pre-set criteria and the expression...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Lei, Sun, Lu, Wang, Wei, Xu, Hao, Li, Yi, Zhao, Jia-Ying, Liu, Da-Zhong, Wang, Fei, Zhang, Lin-You
Formato: Online Artículo Texto
Lenguaje:English
Publicado: D.A. Spandidos 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5783520/
https://www.ncbi.nlm.nih.gov/pubmed/29257283
http://dx.doi.org/10.3892/mmr.2017.8220
_version_ 1783295296837517312
author Yang, Lei
Sun, Lu
Wang, Wei
Xu, Hao
Li, Yi
Zhao, Jia-Ying
Liu, Da-Zhong
Wang, Fei
Zhang, Lin-You
author_facet Yang, Lei
Sun, Lu
Wang, Wei
Xu, Hao
Li, Yi
Zhao, Jia-Ying
Liu, Da-Zhong
Wang, Fei
Zhang, Lin-You
author_sort Yang, Lei
collection PubMed
description The present study aimed to identify the feature genes associated with smoking in lung adenocarcinoma (LAC) samples and explore the underlying mechanism. Three gene expression datasets of LAC samples were downloaded from the Gene Expression Omnibus database through pre-set criteria and the expression data were processed using meta-analysis. Differentially expressed genes (DEGs) between LAC samples of smokers and non-smokers were identified using limma package in R. The classification accuracy of selected DEGs were visualized using hierarchical clustering analysis in R language. A protein-protein interaction (PPI) network was constructed using gene interaction data from the Human Protein Reference Database for the DEGs. Betweenness centrality was calculated for each node in the network and genes with the greatest BC values were utilized for the construction of the support vector machine (SVM) classifier. The dataset GSE43458 was used as the training dataset for the construction and the other datasets (GSE12667 and GSE10072) were used as the validation datasets. The classification accuracy of the classifier was tested using sensitivity, specificity, positive predictive value, negative predictive value and area under curve parameters with the pROC package in R language. The feature genes in the SVM classifier were subjected to pathway enrichment analysis using Fisher's exact test. A total of 347 genes were identified to be differentially expressed between samples of smokers and non-smokers. The PPI network of DEGs were comprised of 202 nodes and 300 edges. An SVM classifier comprised of 26 feature genes was constructed to distinguish between different LAC samples, with prediction accuracies for the GSE43458, GSE12667 and GSE10072 datasets of 100, 100 and 94.83%, respectively. Furthermore, the 26 feature genes that were significantly enriched in 9 overrepresented biological pathways, including extracellular matrix-receptor interaction, proteoglycans in cancer, cell adhesion molecules, p53 signaling pathway, microRNAs in cancer and apoptosis, were identified to be smoking-related genes in LAC. In conclusion, an SVM classifier with a high prediction accuracy for smoking and non-smoking samples was obtained. The genes in the classifier may likely be the potential feature genes associated with the development of patients with LAC who smoke.
format Online
Article
Text
id pubmed-5783520
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher D.A. Spandidos
record_format MEDLINE/PubMed
spelling pubmed-57835202018-02-12 Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification Yang, Lei Sun, Lu Wang, Wei Xu, Hao Li, Yi Zhao, Jia-Ying Liu, Da-Zhong Wang, Fei Zhang, Lin-You Mol Med Rep Articles The present study aimed to identify the feature genes associated with smoking in lung adenocarcinoma (LAC) samples and explore the underlying mechanism. Three gene expression datasets of LAC samples were downloaded from the Gene Expression Omnibus database through pre-set criteria and the expression data were processed using meta-analysis. Differentially expressed genes (DEGs) between LAC samples of smokers and non-smokers were identified using limma package in R. The classification accuracy of selected DEGs were visualized using hierarchical clustering analysis in R language. A protein-protein interaction (PPI) network was constructed using gene interaction data from the Human Protein Reference Database for the DEGs. Betweenness centrality was calculated for each node in the network and genes with the greatest BC values were utilized for the construction of the support vector machine (SVM) classifier. The dataset GSE43458 was used as the training dataset for the construction and the other datasets (GSE12667 and GSE10072) were used as the validation datasets. The classification accuracy of the classifier was tested using sensitivity, specificity, positive predictive value, negative predictive value and area under curve parameters with the pROC package in R language. The feature genes in the SVM classifier were subjected to pathway enrichment analysis using Fisher's exact test. A total of 347 genes were identified to be differentially expressed between samples of smokers and non-smokers. The PPI network of DEGs were comprised of 202 nodes and 300 edges. An SVM classifier comprised of 26 feature genes was constructed to distinguish between different LAC samples, with prediction accuracies for the GSE43458, GSE12667 and GSE10072 datasets of 100, 100 and 94.83%, respectively. Furthermore, the 26 feature genes that were significantly enriched in 9 overrepresented biological pathways, including extracellular matrix-receptor interaction, proteoglycans in cancer, cell adhesion molecules, p53 signaling pathway, microRNAs in cancer and apoptosis, were identified to be smoking-related genes in LAC. In conclusion, an SVM classifier with a high prediction accuracy for smoking and non-smoking samples was obtained. The genes in the classifier may likely be the potential feature genes associated with the development of patients with LAC who smoke. D.A. Spandidos 2018-02 2017-12-07 /pmc/articles/PMC5783520/ /pubmed/29257283 http://dx.doi.org/10.3892/mmr.2017.8220 Text en Copyright: © Yang et al. This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
spellingShingle Articles
Yang, Lei
Sun, Lu
Wang, Wei
Xu, Hao
Li, Yi
Zhao, Jia-Ying
Liu, Da-Zhong
Wang, Fei
Zhang, Lin-You
Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification
title Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification
title_full Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification
title_fullStr Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification
title_full_unstemmed Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification
title_short Construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification
title_sort construction of a 26-feature gene support vector machine classifier for smoking and non-smoking lung adenocarcinoma sample classification
topic Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5783520/
https://www.ncbi.nlm.nih.gov/pubmed/29257283
http://dx.doi.org/10.3892/mmr.2017.8220
work_keys_str_mv AT yanglei constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT sunlu constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT wangwei constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT xuhao constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT liyi constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT zhaojiaying constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT liudazhong constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT wangfei constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification
AT zhanglinyou constructionofa26featuregenesupportvectormachineclassifierforsmokingandnonsmokinglungadenocarcinomasampleclassification