Cargando…
Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?
RNA sequencing (RNA-seq) is becoming a prevalent approach to quantify gene expression and is expected to gain better insights into a number of biological and biomedical questions compared to DNA microarrays. Most importantly, RNA-seq allows us to quantify expression at the gene or transcript levels....
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097660/ https://www.ncbi.nlm.nih.gov/pubmed/29941426 http://dx.doi.org/10.1261/rna.062802.117 |
_version_ | 1783348340423917568 |
---|---|
author | Johnson, Nathan T. Dhroso, Andi Hughes, Katelyn J. Korkin, Dmitry |
author_facet | Johnson, Nathan T. Dhroso, Andi Hughes, Katelyn J. Korkin, Dmitry |
author_sort | Johnson, Nathan T. |
collection | PubMed |
description | RNA sequencing (RNA-seq) is becoming a prevalent approach to quantify gene expression and is expected to gain better insights into a number of biological and biomedical questions compared to DNA microarrays. Most importantly, RNA-seq allows us to quantify expression at the gene or transcript levels. However, leveraging the RNA-seq data requires development of new data mining and analytics methods. Supervised learning methods are commonly used approaches for biological data analysis that have recently gained attention for their applications to RNA-seq data. Here, we assess the utility of supervised learning methods trained on RNA-seq data for a diverse range of biological classification tasks. We hypothesize that the transcript-level expression data are more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment utilizes multiple data sets, organisms, lab groups, and RNA-seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-seq data sets and include over 2000 samples that come from multiple organisms, lab groups, and RNA-seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes, and pathological tumor stages for the samples from the cancerous tissue. For each problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the transcript-based classifiers outperform or are comparable with gene expression-based methods. The top-performing techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-seq based data analysis. |
format | Online Article Text |
id | pubmed-6097660 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-60976602019-09-01 Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? Johnson, Nathan T. Dhroso, Andi Hughes, Katelyn J. Korkin, Dmitry RNA Bioinformatics RNA sequencing (RNA-seq) is becoming a prevalent approach to quantify gene expression and is expected to gain better insights into a number of biological and biomedical questions compared to DNA microarrays. Most importantly, RNA-seq allows us to quantify expression at the gene or transcript levels. However, leveraging the RNA-seq data requires development of new data mining and analytics methods. Supervised learning methods are commonly used approaches for biological data analysis that have recently gained attention for their applications to RNA-seq data. Here, we assess the utility of supervised learning methods trained on RNA-seq data for a diverse range of biological classification tasks. We hypothesize that the transcript-level expression data are more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment utilizes multiple data sets, organisms, lab groups, and RNA-seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-seq data sets and include over 2000 samples that come from multiple organisms, lab groups, and RNA-seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes, and pathological tumor stages for the samples from the cancerous tissue. For each problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the transcript-based classifiers outperform or are comparable with gene expression-based methods. The top-performing techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-seq based data analysis. Cold Spring Harbor Laboratory Press 2018-09 /pmc/articles/PMC6097660/ /pubmed/29941426 http://dx.doi.org/10.1261/rna.062802.117 Text en © 2018 Johnson et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by the RNA Society for the first 12 months after the full-issue publication date (see http://rnajournal.cshlp.org/site/misc/terms.xhtml). After 12 months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/. |
spellingShingle | Bioinformatics Johnson, Nathan T. Dhroso, Andi Hughes, Katelyn J. Korkin, Dmitry Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? |
title | Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? |
title_full | Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? |
title_fullStr | Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? |
title_full_unstemmed | Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? |
title_short | Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? |
title_sort | biological classification with rna-seq data: can alternatively spliced transcript expression enhance machine learning classifiers? |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097660/ https://www.ncbi.nlm.nih.gov/pubmed/29941426 http://dx.doi.org/10.1261/rna.062802.117 |
work_keys_str_mv | AT johnsonnathant biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers AT dhrosoandi biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers AT hugheskatelynj biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers AT korkindmitry biologicalclassificationwithrnaseqdatacanalternativelysplicedtranscriptexpressionenhancemachinelearningclassifiers |