Cargando…

Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability

Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the unde...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zheng, Hongyu, Ma, Cong, Kingsford, Carl
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Mary Ann Liebert, Inc., publishers 2022
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8892959/ https://www.ncbi.nlm.nih.gov/pubmed/35041494 http://dx.doi.org/10.1089/cmb.2021.0444

_version_	1784662285126467584
author	Zheng, Hongyu Ma, Cong Kingsford, Carl
author_facet	Zheng, Hongyu Ma, Cong Kingsford, Carl
author_sort	Zheng, Hongyu
collection	PubMed
description	Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a “confidence range of expression” for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%–50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%–47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates.
format	Online Article Text
id	pubmed-8892959
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Mary Ann Liebert, Inc., publishers
record_format	MEDLINE/PubMed
spelling	pubmed-88929592022-03-03 Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability Zheng, Hongyu Ma, Cong Kingsford, Carl J Comput Biol Research Articles Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a “confidence range of expression” for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%–50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%–47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates. Mary Ann Liebert, Inc., publishers 2022-02-01 2022-02-16 /pmc/articles/PMC8892959/ /pubmed/35041494 http://dx.doi.org/10.1089/cmb.2021.0444 Text en © Hongyu Zheng, et al., 2022. Published by Mary Ann Liebert, Inc. https://creativecommons.org/licenses/by/4.0/This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0 (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle	Research Articles Zheng, Hongyu Ma, Cong Kingsford, Carl Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title	Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_full	Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_fullStr	Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_full_unstemmed	Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_short	Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability
title_sort	deriving ranges of optimal estimated transcript expression due to nonidentifiability
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8892959/ https://www.ncbi.nlm.nih.gov/pubmed/35041494 http://dx.doi.org/10.1089/cmb.2021.0444
work_keys_str_mv	AT zhenghongyu derivingrangesofoptimalestimatedtranscriptexpressionduetononidentifiability AT macong derivingrangesofoptimalestimatedtranscriptexpressionduetononidentifiability AT kingsfordcarl derivingrangesofoptimalestimatedtranscriptexpressionduetononidentifiability

Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability

Ejemplares similares