Cargando…

A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation

One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to mu...

Descripción completa

Detalles Bibliográficos
Autores principales: McDermaid, Adam, Chen, Xin, Zhang, Yiran, Wang, Cankun, Gu, Shaopeng, Xie, Juan, Ma, Qin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6102479/
https://www.ncbi.nlm.nih.gov/pubmed/30154828
http://dx.doi.org/10.3389/fgene.2018.00313
_version_ 1783349172630454272
author McDermaid, Adam
Chen, Xin
Zhang, Yiran
Wang, Cankun
Gu, Shaopeng
Xie, Juan
Ma, Qin
author_facet McDermaid, Adam
Chen, Xin
Zhang, Yiran
Wang, Cankun
Gu, Shaopeng
Xie, Juan
Ma, Qin
author_sort McDermaid, Adam
collection PubMed
description One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc. Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses. Our investigation into 95 RNA-Seq datasets from seven plant and animal species (totaling 1,951 GB) indicates an average of roughly 22% of all reads are MMRs. Here we present a machine learning-based tool called GeneQC (Gene expression Quality Control), which can accurately estimate the reliability of each gene's expression level derived from an RNA-Seq dataset. The underlying algorithm is designed based on extracted genomic and transcriptomic features, which are then combined using elastic-net regularization and mixture model fitting to provide a clearer picture of mapping uncertainty for each gene. GeneQC allows researchers to determine reliable expression estimations and conduct further analysis on the gene expression that is of sufficient quality. This tool also enables researchers to investigate continued re-alignment methods to determine more accurate gene expression estimates for those with low reliability. Application of GeneQC reveals high level of mapping uncertainty in plant samples and limited, severe mapping uncertainty in animal samples. GeneQC is freely available at http://bmbl.sdstate.edu/GeneQC/home.html.
format Online
Article
Text
id pubmed-6102479
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-61024792018-08-28 A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation McDermaid, Adam Chen, Xin Zhang, Yiran Wang, Cankun Gu, Shaopeng Xie, Juan Ma, Qin Front Genet Genetics One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc. Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses. Our investigation into 95 RNA-Seq datasets from seven plant and animal species (totaling 1,951 GB) indicates an average of roughly 22% of all reads are MMRs. Here we present a machine learning-based tool called GeneQC (Gene expression Quality Control), which can accurately estimate the reliability of each gene's expression level derived from an RNA-Seq dataset. The underlying algorithm is designed based on extracted genomic and transcriptomic features, which are then combined using elastic-net regularization and mixture model fitting to provide a clearer picture of mapping uncertainty for each gene. GeneQC allows researchers to determine reliable expression estimations and conduct further analysis on the gene expression that is of sufficient quality. This tool also enables researchers to investigate continued re-alignment methods to determine more accurate gene expression estimates for those with low reliability. Application of GeneQC reveals high level of mapping uncertainty in plant samples and limited, severe mapping uncertainty in animal samples. GeneQC is freely available at http://bmbl.sdstate.edu/GeneQC/home.html. Frontiers Media S.A. 2018-08-14 /pmc/articles/PMC6102479/ /pubmed/30154828 http://dx.doi.org/10.3389/fgene.2018.00313 Text en Copyright © 2018 McDermaid, Chen, Zhang, Wang, Gu, Xie and Ma. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
McDermaid, Adam
Chen, Xin
Zhang, Yiran
Wang, Cankun
Gu, Shaopeng
Xie, Juan
Ma, Qin
A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
title A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
title_full A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
title_fullStr A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
title_full_unstemmed A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
title_short A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
title_sort new machine learning-based framework for mapping uncertainty analysis in rna-seq read alignment and gene expression estimation
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6102479/
https://www.ncbi.nlm.nih.gov/pubmed/30154828
http://dx.doi.org/10.3389/fgene.2018.00313
work_keys_str_mv AT mcdermaidadam anewmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT chenxin anewmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT zhangyiran anewmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT wangcankun anewmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT gushaopeng anewmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT xiejuan anewmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT maqin anewmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT mcdermaidadam newmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT chenxin newmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT zhangyiran newmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT wangcankun newmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT gushaopeng newmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT xiejuan newmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation
AT maqin newmachinelearningbasedframeworkformappinguncertaintyanalysisinrnaseqreadalignmentandgeneexpressionestimation