Cargando…

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

BACKGROUND: The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generate...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, WeiBo, Sun, Wei, Wang, Wei, Szatkiewicz, Jin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5831535/ https://www.ncbi.nlm.nih.gov/pubmed/29490610 http://dx.doi.org/10.1186/s12859-018-2077-6

_version_	1783303160631132160
author	Wang, WeiBo Sun, Wei Wang, Wei Szatkiewicz, Jin
author_facet	Wang, WeiBo Sun, Wei Wang, Wei Szatkiewicz, Jin
author_sort	Wang, WeiBo
collection	PubMed
description	BACKGROUND: The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. RESULTS: We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection. CONCLUSIONS: Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2077-6) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5831535
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-58315352018-03-05 A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection Wang, WeiBo Sun, Wei Wang, Wei Szatkiewicz, Jin BMC Bioinformatics Methodology Article BACKGROUND: The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. RESULTS: We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection. CONCLUSIONS: Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2077-6) contains supplementary material, which is available to authorized users. BioMed Central 2018-03-01 /pmc/articles/PMC5831535/ /pubmed/29490610 http://dx.doi.org/10.1186/s12859-018-2077-6 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Wang, WeiBo Sun, Wei Wang, Wei Szatkiewicz, Jin A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_full	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_fullStr	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_full_unstemmed	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_short	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_sort	randomized approach to speed up the analysis of large-scale read-count data in the application of cnv detection
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5831535/ https://www.ncbi.nlm.nih.gov/pubmed/29490610 http://dx.doi.org/10.1186/s12859-018-2077-6
work_keys_str_mv	AT wangweibo arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT sunwei arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT wangwei arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT szatkiewiczjin arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT wangweibo randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT sunwei randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT wangwei randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT szatkiewiczjin randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

Ejemplares similares