Cargando…

Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis

BACKGROUND: Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip’s law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wong, Wing-Cheong, Ng, Hong-kiat, Tantoso, Erwin, Soong, Richie, Eisenhaber, Frank
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5809866/ https://www.ncbi.nlm.nih.gov/pubmed/29433547 http://dx.doi.org/10.1186/s13062-018-0204-y

_version_	1783299630878949376
author	Wong, Wing-Cheong Ng, Hong-kiat Tantoso, Erwin Soong, Richie Eisenhaber, Frank
author_facet	Wong, Wing-Cheong Ng, Hong-kiat Tantoso, Erwin Soong, Richie Eisenhaber, Frank
author_sort	Wong, Wing-Cheong
collection	PubMed
description	BACKGROUND: Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip’s law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finite-size effects will set in to force a power-law distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a log-log plot. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified. RESULTS: The anecdotal description of transcript abundance being almost Zipf’s law-like distributed can be conceptualized as the imperfect mathematical rendition of the Pareto power-law distribution when subjected to the finite-size effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Next-generation sequencing) datasets: an in-house generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spike-in miRNA data; Firstly, the finite-size effects causes the deviations of sequencing count data from Zipf’s law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward power-law correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signal-to-noise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the power-law correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple power-law correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis. CONCLUSIONS: The finite-size effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple power-law correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings. REVIEWERS: This article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13062-018-0204-y) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5809866
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-58098662018-02-16 Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis Wong, Wing-Cheong Ng, Hong-kiat Tantoso, Erwin Soong, Richie Eisenhaber, Frank Biol Direct Research BACKGROUND: Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip’s law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finite-size effects will set in to force a power-law distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a log-log plot. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified. RESULTS: The anecdotal description of transcript abundance being almost Zipf’s law-like distributed can be conceptualized as the imperfect mathematical rendition of the Pareto power-law distribution when subjected to the finite-size effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Next-generation sequencing) datasets: an in-house generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spike-in miRNA data; Firstly, the finite-size effects causes the deviations of sequencing count data from Zipf’s law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward power-law correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signal-to-noise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the power-law correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple power-law correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis. CONCLUSIONS: The finite-size effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple power-law correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings. REVIEWERS: This article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13062-018-0204-y) contains supplementary material, which is available to authorized users. BioMed Central 2018-02-12 /pmc/articles/PMC5809866/ /pubmed/29433547 http://dx.doi.org/10.1186/s13062-018-0204-y Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Wong, Wing-Cheong Ng, Hong-kiat Tantoso, Erwin Soong, Richie Eisenhaber, Frank Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis
title	Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis
title_full	Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis
title_fullStr	Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis
title_full_unstemmed	Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis
title_short	Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis
title_sort	finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5809866/ https://www.ncbi.nlm.nih.gov/pubmed/29433547 http://dx.doi.org/10.1186/s13062-018-0204-y
work_keys_str_mv	AT wongwingcheong finitesizeeffectsintranscriptsequencingcountdistributionitspowerlawcorrectionnecessarilyprecedesdownstreamnormalizationandcomparativeanalysis AT nghongkiat finitesizeeffectsintranscriptsequencingcountdistributionitspowerlawcorrectionnecessarilyprecedesdownstreamnormalizationandcomparativeanalysis AT tantosoerwin finitesizeeffectsintranscriptsequencingcountdistributionitspowerlawcorrectionnecessarilyprecedesdownstreamnormalizationandcomparativeanalysis AT soongrichie finitesizeeffectsintranscriptsequencingcountdistributionitspowerlawcorrectionnecessarilyprecedesdownstreamnormalizationandcomparativeanalysis AT eisenhaberfrank finitesizeeffectsintranscriptsequencingcountdistributionitspowerlawcorrectionnecessarilyprecedesdownstreamnormalizationandcomparativeanalysis

Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis

Ejemplares similares