Cargando…

RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes

BACKGROUND: Although different quality controls have been applied at different stages of the sample preparation and data analysis to ensure both reproducibility and reliability of RNA-seq results, there are still limitations and bias on the detectability for certain differentially expressed genes (D...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Likai, Xi, Yanpeng, Sung, Sibum, Qiao, Hong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6053725/
https://www.ncbi.nlm.nih.gov/pubmed/30029596
http://dx.doi.org/10.1186/s12864-018-4932-2
_version_ 1783340877182140416
author Wang, Likai
Xi, Yanpeng
Sung, Sibum
Qiao, Hong
author_facet Wang, Likai
Xi, Yanpeng
Sung, Sibum
Qiao, Hong
author_sort Wang, Likai
collection PubMed
description BACKGROUND: Although different quality controls have been applied at different stages of the sample preparation and data analysis to ensure both reproducibility and reliability of RNA-seq results, there are still limitations and bias on the detectability for certain differentially expressed genes (DEGs). Whether the transcriptional dynamics of a gene can be captured accurately depends on experimental design/operation and the following data analysis processes. The workflow of subsequent data processing, such as reads alignment, transcript quantification, normalization, and statistical methods for ultimate identification of DEGs can influence the accuracy and sensitivity of DEGs analysis, producing a certain number of false-positivity or false-negativity. Machine learning (ML) is a multidisciplinary field that employs computer science, artificial intelligence, computational statistics and information theory to construct algorithms that can learn from existing data sets and to make predictions on new data set. ML–based differential network analysis has been applied to predict stress-responsive genes through learning the patterns of 32 expression characteristics of known stress-related genes. In addition, the epigenetic regulation plays critical roles in gene expression, therefore, DNA and histone methylation data has been shown to be powerful for ML-based model for prediction of gene expression in many systems, including lung cancer cells. Therefore, it is promising that ML-based methods could help to identify the DEGs that are not identified by traditional RNA-seq method. RESULTS: We identified the top 23 most informative features through assessing the performance of three different feature selection algorithms combined with five different classification methods on training and testing data sets. By comprehensive comparison, we found that the model based on InfoGain feature selection and Logistic Regression classification is powerful for DEGs prediction. Moreover, the power and performance of ML-based prediction was validated by the prediction on ethylene regulated gene expression and the following qRT-PCR. CONCLUSIONS: Our study shows that the combination of ML-based method with RNA-seq greatly improves the sensitivity of DEGs identification. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-4932-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6053725
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60537252018-07-23 RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes Wang, Likai Xi, Yanpeng Sung, Sibum Qiao, Hong BMC Genomics Methodology Article BACKGROUND: Although different quality controls have been applied at different stages of the sample preparation and data analysis to ensure both reproducibility and reliability of RNA-seq results, there are still limitations and bias on the detectability for certain differentially expressed genes (DEGs). Whether the transcriptional dynamics of a gene can be captured accurately depends on experimental design/operation and the following data analysis processes. The workflow of subsequent data processing, such as reads alignment, transcript quantification, normalization, and statistical methods for ultimate identification of DEGs can influence the accuracy and sensitivity of DEGs analysis, producing a certain number of false-positivity or false-negativity. Machine learning (ML) is a multidisciplinary field that employs computer science, artificial intelligence, computational statistics and information theory to construct algorithms that can learn from existing data sets and to make predictions on new data set. ML–based differential network analysis has been applied to predict stress-responsive genes through learning the patterns of 32 expression characteristics of known stress-related genes. In addition, the epigenetic regulation plays critical roles in gene expression, therefore, DNA and histone methylation data has been shown to be powerful for ML-based model for prediction of gene expression in many systems, including lung cancer cells. Therefore, it is promising that ML-based methods could help to identify the DEGs that are not identified by traditional RNA-seq method. RESULTS: We identified the top 23 most informative features through assessing the performance of three different feature selection algorithms combined with five different classification methods on training and testing data sets. By comprehensive comparison, we found that the model based on InfoGain feature selection and Logistic Regression classification is powerful for DEGs prediction. Moreover, the power and performance of ML-based prediction was validated by the prediction on ethylene regulated gene expression and the following qRT-PCR. CONCLUSIONS: Our study shows that the combination of ML-based method with RNA-seq greatly improves the sensitivity of DEGs identification. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-4932-2) contains supplementary material, which is available to authorized users. BioMed Central 2018-07-20 /pmc/articles/PMC6053725/ /pubmed/30029596 http://dx.doi.org/10.1186/s12864-018-4932-2 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Wang, Likai
Xi, Yanpeng
Sung, Sibum
Qiao, Hong
RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes
title RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes
title_full RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes
title_fullStr RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes
title_full_unstemmed RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes
title_short RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes
title_sort rna-seq assistant: machine learning based methods to identify more transcriptional regulated genes
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6053725/
https://www.ncbi.nlm.nih.gov/pubmed/30029596
http://dx.doi.org/10.1186/s12864-018-4932-2
work_keys_str_mv AT wanglikai rnaseqassistantmachinelearningbasedmethodstoidentifymoretranscriptionalregulatedgenes
AT xiyanpeng rnaseqassistantmachinelearningbasedmethodstoidentifymoretranscriptionalregulatedgenes
AT sungsibum rnaseqassistantmachinelearningbasedmethodstoidentifymoretranscriptionalregulatedgenes
AT qiaohong rnaseqassistantmachinelearningbasedmethodstoidentifymoretranscriptionalregulatedgenes