Cargando…

Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability

Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the unde...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng, Hongyu, Ma, Cong, Kingsford, Carl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Mary Ann Liebert, Inc., publishers 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8892959/
https://www.ncbi.nlm.nih.gov/pubmed/35041494
http://dx.doi.org/10.1089/cmb.2021.0444
_version_ 1784662285126467584
author Zheng, Hongyu
Ma, Cong
Kingsford, Carl
author_facet Zheng, Hongyu
Ma, Cong
Kingsford, Carl
author_sort Zheng, Hongyu
collection PubMed
description Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a “confidence range of expression” for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%–50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%–47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates.
format Online
Article
Text
id pubmed-8892959
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Mary Ann Liebert, Inc., publishers
record_format MEDLINE/PubMed
spelling pubmed-88929592022-03-03 Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability Zheng, Hongyu Ma, Cong Kingsford, Carl J Comput Biol Research Articles Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a “confidence range of expression” for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%–50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%–47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates. Mary Ann Liebert, Inc., publishers 2022-02-01 2022-02-16 /pmc/articles/PMC8892959/ /pubmed/35041494 http://dx.doi.org/10.1089/cmb.2021.0444 Text en © Hongyu Zheng, et al., 2022. Published by Mary Ann Liebert, Inc. https://creativecommons.org/licenses/by/4.0/This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0 (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Research Articles
Zheng, Hongyu
Ma, Cong
Kingsford, Carl
Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_full Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_fullStr Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_full_unstemmed Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_short Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_sort deriving ranges of optimal estimated transcript expression due to nonidentifiability
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8892959/
https://www.ncbi.nlm.nih.gov/pubmed/35041494
http://dx.doi.org/10.1089/cmb.2021.0444
work_keys_str_mv AT zhenghongyu derivingrangesofoptimalestimatedtranscriptexpressionduetononidentifiability
AT macong derivingrangesofoptimalestimatedtranscriptexpressionduetononidentifiability
AT kingsfordcarl derivingrangesofoptimalestimatedtranscriptexpressionduetononidentifiability