Cargando…

Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis

BACKGROUND: The current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted port...

Descripción completa

Detalles Bibliográficos
Autores principales: Domanska, Diana, Kanduri, Chakravarthi, Simovski, Boris, Sandve, Geir Kjetil
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6293655/
https://www.ncbi.nlm.nih.gov/pubmed/30547739
http://dx.doi.org/10.1186/s12859-018-2438-1
_version_ 1783380583029669888
author Domanska, Diana
Kanduri, Chakravarthi
Simovski, Boris
Sandve, Geir Kjetil
author_facet Domanska, Diana
Kanduri, Chakravarthi
Simovski, Boris
Sandve, Geir Kjetil
author_sort Domanska, Diana
collection PubMed
description BACKGROUND: The current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted portion of the genomic sequence, meaning that regions from the unassayed portion of the genomic sequence cannot be detected in those experiments. We here refer to all such regions as inaccessible regions, and hypothesize that ignoring these regions in the null model may increase false findings in statistical testing of colocalization of genomic features. RESULTS: Our explorative analyses confirm that the genomic regions in public genomic tracks intersect very little with assembly gaps of human reference genomes (hg19 and hg38). The little intersection was observed only at the beginning and end portions of the gap regions. Further, we simulated a set of synthetic tracks by matching the properties of real genomic tracks in a way that nullified any true association between them. This allowed us to test our hypothesis that not avoiding inaccessible regions (as represented by assembly gaps) in the null model would result in spurious inflation of statistical significance. We contrasted the distributions of test statistics and p-values of Monte Carlo-based permutation tests that either avoided or did not avoid assembly gaps in the null model when testing colocalization between a pair of tracks. We observed that the statistical tests that did not account for assembly gaps in the null model resulted in a distribution of the test statistic that is shifted to the right and a distribution of p-values that is shifted to the left (indicating inflated significance). We observed a similar level of inflated significance in hg19 and hg38, despite assembly gaps covering a smaller proportion of the latter reference genome. CONCLUSION: We provide empirical evidence demonstrating that inaccessible regions, even when covering only a few percentages of the genome, can lead to a substantial amount of false findings if not accounted for in statistical colocalization analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2438-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6293655
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-62936552018-12-18 Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis Domanska, Diana Kanduri, Chakravarthi Simovski, Boris Sandve, Geir Kjetil BMC Bioinformatics Research Article BACKGROUND: The current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted portion of the genomic sequence, meaning that regions from the unassayed portion of the genomic sequence cannot be detected in those experiments. We here refer to all such regions as inaccessible regions, and hypothesize that ignoring these regions in the null model may increase false findings in statistical testing of colocalization of genomic features. RESULTS: Our explorative analyses confirm that the genomic regions in public genomic tracks intersect very little with assembly gaps of human reference genomes (hg19 and hg38). The little intersection was observed only at the beginning and end portions of the gap regions. Further, we simulated a set of synthetic tracks by matching the properties of real genomic tracks in a way that nullified any true association between them. This allowed us to test our hypothesis that not avoiding inaccessible regions (as represented by assembly gaps) in the null model would result in spurious inflation of statistical significance. We contrasted the distributions of test statistics and p-values of Monte Carlo-based permutation tests that either avoided or did not avoid assembly gaps in the null model when testing colocalization between a pair of tracks. We observed that the statistical tests that did not account for assembly gaps in the null model resulted in a distribution of the test statistic that is shifted to the right and a distribution of p-values that is shifted to the left (indicating inflated significance). We observed a similar level of inflated significance in hg19 and hg38, despite assembly gaps covering a smaller proportion of the latter reference genome. CONCLUSION: We provide empirical evidence demonstrating that inaccessible regions, even when covering only a few percentages of the genome, can lead to a substantial amount of false findings if not accounted for in statistical colocalization analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2438-1) contains supplementary material, which is available to authorized users. BioMed Central 2018-12-14 /pmc/articles/PMC6293655/ /pubmed/30547739 http://dx.doi.org/10.1186/s12859-018-2438-1 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Domanska, Diana
Kanduri, Chakravarthi
Simovski, Boris
Sandve, Geir Kjetil
Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
title Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
title_full Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
title_fullStr Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
title_full_unstemmed Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
title_short Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
title_sort mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6293655/
https://www.ncbi.nlm.nih.gov/pubmed/30547739
http://dx.doi.org/10.1186/s12859-018-2438-1
work_keys_str_mv AT domanskadiana mindthegapsoverlookinginaccessibleregionsconfoundsstatisticaltestingingenomeanalysis
AT kandurichakravarthi mindthegapsoverlookinginaccessibleregionsconfoundsstatisticaltestingingenomeanalysis
AT simovskiboris mindthegapsoverlookinginaccessibleregionsconfoundsstatisticaltestingingenomeanalysis
AT sandvegeirkjetil mindthegapsoverlookinginaccessibleregionsconfoundsstatisticaltestingingenomeanalysis