Cargando…

Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?

RNA sequencing (RNA-seq) is becoming a prevalent approach to quantify gene expression and is expected to gain better insights into a number of biological and biomedical questions compared to DNA microarrays. Most importantly, RNA-seq allows us to quantify expression at the gene or transcript levels....

Descripción completa

Detalles Bibliográficos
Autores principales: Johnson, Nathan T., Dhroso, Andi, Hughes, Katelyn J., Korkin, Dmitry
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097660/
https://www.ncbi.nlm.nih.gov/pubmed/29941426
http://dx.doi.org/10.1261/rna.062802.117
_version_ 1783348340423917568
author Johnson, Nathan T.
Dhroso, Andi
Hughes, Katelyn J.
Korkin, Dmitry
author_facet Johnson, Nathan T.
Dhroso, Andi
Hughes, Katelyn J.
Korkin, Dmitry
author_sort Johnson, Nathan T.
collection PubMed
description RNA sequencing (RNA-seq) is becoming a prevalent approach to quantify gene expression and is expected to gain better insights into a number of biological and biomedical questions compared to DNA microarrays. Most importantly, RNA-seq allows us to quantify expression at the gene or transcript levels. However, leveraging the RNA-seq data requires development of new data mining and analytics methods. Supervised learning methods are commonly used approaches for biological data analysis that have recently gained attention for their applications to RNA-seq data. Here, we assess the utility of supervised learning methods trained on RNA-seq data for a diverse range of biological classification tasks. We hypothesize that the transcript-level expression data are more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment utilizes multiple data sets, organisms, lab groups, and RNA-seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-seq data sets and include over 2000 samples that come from multiple organisms, lab groups, and RNA-seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes, and pathological tumor stages for the samples from the cancerous tissue. For each problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the transcript-based classifiers outperform or are comparable with gene expression-based methods. The top-performing techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-seq based data analysis.
format Online
Article
Text
id pubmed-6097660
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-60976602019-09-01 Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? Johnson, Nathan T. Dhroso, Andi Hughes, Katelyn J. Korkin, Dmitry RNA Bioinformatics RNA sequencing (RNA-seq) is becoming a prevalent approach to quantify gene expression and is expected to gain better insights into a number of biological and biomedical questions compared to DNA microarrays. Most importantly, RNA-seq allows us to quantify expression at the gene or transcript levels. However, leveraging the RNA-seq data requires development of new data mining and analytics methods. Supervised learning methods are commonly used approaches for biological data analysis that have recently gained attention for their applications to RNA-seq data. Here, we assess the utility of supervised learning methods trained on RNA-seq data for a diverse range of biological classification tasks. We hypothesize that the transcript-level expression data are more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment utilizes multiple data sets, organisms, lab groups, and RNA-seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-seq data sets and include over 2000 samples that come from multiple organisms, lab groups, and RNA-seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes, and pathological tumor stages for the samples from the cancerous tissue. For each problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the transcript-based classifiers outperform or are comparable with gene expression-based methods. The top-performing techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-seq based data analysis. Cold Spring Harbor Laboratory Press 2018-09 /pmc/articles/PMC6097660/ /pubmed/29941426 http://dx.doi.org/10.1261/rna.062802.117 Text en © 2018 Johnson et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by the RNA Society for the first 12 months after the full-issue publication date (see http://rnajournal.cshlp.org/site/misc/terms.xhtml). After 12 months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Bioinformatics
Johnson, Nathan T.
Dhroso, Andi
Hughes, Katelyn J.
Korkin, Dmitry
Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?
title Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?
title_full Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?
title_fullStr Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?
title_full_unstemmed Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?
title_short Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?
title_sort biological classification with rna-seq data: can alternatively spliced transcript expression enhance machine learning classifiers?
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097660/
https://www.ncbi.nlm.nih.gov/pubmed/29941426
http://dx.doi.org/10.1261/rna.062802.117
work_keys_str_mv AT johnsonnathant biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers
AT dhrosoandi biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers
AT hugheskatelynj biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers
AT korkindmitry biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers