Cargando…

TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks

BACKGROUND: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine lea...

Descripción completa

Detalles Bibliográficos
Autores principales: Jones, Sara, Beyers, Matthew, Shukla, Maulik, Xia, Fangfang, Brettin, Thomas, Stevens, Rick, Weil, M Ryan, Ranganathan Ganakammal, Satishkumar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: SAGE Publications 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9729992/
https://www.ncbi.nlm.nih.gov/pubmed/36507076
http://dx.doi.org/10.1177/11769351221139491
_version_ 1784845584669081600
author Jones, Sara
Beyers, Matthew
Shukla, Maulik
Xia, Fangfang
Brettin, Thomas
Stevens, Rick
Weil, M Ryan
Ranganathan Ganakammal, Satishkumar
author_facet Jones, Sara
Beyers, Matthew
Shukla, Maulik
Xia, Fangfang
Brettin, Thomas
Stevens, Rick
Weil, M Ryan
Ranganathan Ganakammal, Satishkumar
author_sort Jones, Sara
collection PubMed
description BACKGROUND: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years. METHODS: In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models. RESULTS: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types. CONCLUSIONS: We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.
format Online
Article
Text
id pubmed-9729992
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher SAGE Publications
record_format MEDLINE/PubMed
spelling pubmed-97299922022-12-09 TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks Jones, Sara Beyers, Matthew Shukla, Maulik Xia, Fangfang Brettin, Thomas Stevens, Rick Weil, M Ryan Ranganathan Ganakammal, Satishkumar Cancer Inform Original Research BACKGROUND: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years. METHODS: In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models. RESULTS: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types. CONCLUSIONS: We packaged all 4 models as a Python-based deep learning classification tool called TULIP (TUmor CLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types. SAGE Publications 2022-12-05 /pmc/articles/PMC9729992/ /pubmed/36507076 http://dx.doi.org/10.1177/11769351221139491 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by-nc/4.0/This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle Original Research
Jones, Sara
Beyers, Matthew
Shukla, Maulik
Xia, Fangfang
Brettin, Thomas
Stevens, Rick
Weil, M Ryan
Ranganathan Ganakammal, Satishkumar
TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_full TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_fullStr TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_full_unstemmed TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_short TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_sort tulip: an rna-seq-based primary tumor type prediction tool using convolutional neural networks
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9729992/
https://www.ncbi.nlm.nih.gov/pubmed/36507076
http://dx.doi.org/10.1177/11769351221139491
work_keys_str_mv AT jonessara tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT beyersmatthew tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT shuklamaulik tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT xiafangfang tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT brettinthomas tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT stevensrick tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT weilmryan tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT ranganathanganakammalsatishkumar tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks