Cargando…

A (fire)cloud-based DNA methylation data preprocessing and quality control platform

BACKGROUND: Bisulfite sequencing allows base-pair resolution profiling of DNA methylation and has recently been adapted for use in single-cells. Analyzing these data, including making comparisons with existing data, remains challenging due to the scale of the data and differences in preprocessing me...

Descripción completa

Detalles Bibliográficos
Autores principales: Kangeyan, Divy, Dunford, Andrew, Iyer, Sowmya, Stewart, Chip, Hanna, Megan, Getz, Gad, Aryee, Martin J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6440105/
https://www.ncbi.nlm.nih.gov/pubmed/30922215
http://dx.doi.org/10.1186/s12859-019-2750-4
_version_ 1783407333392515072
author Kangeyan, Divy
Dunford, Andrew
Iyer, Sowmya
Stewart, Chip
Hanna, Megan
Getz, Gad
Aryee, Martin J.
author_facet Kangeyan, Divy
Dunford, Andrew
Iyer, Sowmya
Stewart, Chip
Hanna, Megan
Getz, Gad
Aryee, Martin J.
author_sort Kangeyan, Divy
collection PubMed
description BACKGROUND: Bisulfite sequencing allows base-pair resolution profiling of DNA methylation and has recently been adapted for use in single-cells. Analyzing these data, including making comparisons with existing data, remains challenging due to the scale of the data and differences in preprocessing methods between published datasets. RESULTS: We present a set of preprocessing pipelines for bisulfite sequencing DNA methylation data that include a new R/Bioconductor package, scmeth, for a series of efficient QC analyses of large datasets. The pipelines go from raw data to CpG-level methylation estimates and can be run, with identical results, either on a single computer, in an HPC cluster or on Google Cloud Compute resources. These pipelines are designed to allow users to 1) ensure reproducibility of analyses, 2) achieve scalability to large whole genome datasets with 100 GB+ of raw data per sample and to single-cell datasets with thousands of cells, 3) enable integration and comparison between user-provided data and publicly available data, as all samples can be processed through the same pipeline, and 4) access to best-practice analysis pipelines. Pipelines are provided for whole genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS) and hybrid selection (capture) bisulfite sequencing (HSBS). CONCLUSIONS: The workflows produce data quality metrics, visualization tracks, and aggregated output for further downstream analysis. Optional use of cloud computing resources facilitates analysis of large datasets, and integration with existing methylome profiles. The workflow design principles are applicable to other genomic data types.
format Online
Article
Text
id pubmed-6440105
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64401052019-04-11 A (fire)cloud-based DNA methylation data preprocessing and quality control platform Kangeyan, Divy Dunford, Andrew Iyer, Sowmya Stewart, Chip Hanna, Megan Getz, Gad Aryee, Martin J. BMC Bioinformatics Software BACKGROUND: Bisulfite sequencing allows base-pair resolution profiling of DNA methylation and has recently been adapted for use in single-cells. Analyzing these data, including making comparisons with existing data, remains challenging due to the scale of the data and differences in preprocessing methods between published datasets. RESULTS: We present a set of preprocessing pipelines for bisulfite sequencing DNA methylation data that include a new R/Bioconductor package, scmeth, for a series of efficient QC analyses of large datasets. The pipelines go from raw data to CpG-level methylation estimates and can be run, with identical results, either on a single computer, in an HPC cluster or on Google Cloud Compute resources. These pipelines are designed to allow users to 1) ensure reproducibility of analyses, 2) achieve scalability to large whole genome datasets with 100 GB+ of raw data per sample and to single-cell datasets with thousands of cells, 3) enable integration and comparison between user-provided data and publicly available data, as all samples can be processed through the same pipeline, and 4) access to best-practice analysis pipelines. Pipelines are provided for whole genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS) and hybrid selection (capture) bisulfite sequencing (HSBS). CONCLUSIONS: The workflows produce data quality metrics, visualization tracks, and aggregated output for further downstream analysis. Optional use of cloud computing resources facilitates analysis of large datasets, and integration with existing methylome profiles. The workflow design principles are applicable to other genomic data types. BioMed Central 2019-03-29 /pmc/articles/PMC6440105/ /pubmed/30922215 http://dx.doi.org/10.1186/s12859-019-2750-4 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Kangeyan, Divy
Dunford, Andrew
Iyer, Sowmya
Stewart, Chip
Hanna, Megan
Getz, Gad
Aryee, Martin J.
A (fire)cloud-based DNA methylation data preprocessing and quality control platform
title A (fire)cloud-based DNA methylation data preprocessing and quality control platform
title_full A (fire)cloud-based DNA methylation data preprocessing and quality control platform
title_fullStr A (fire)cloud-based DNA methylation data preprocessing and quality control platform
title_full_unstemmed A (fire)cloud-based DNA methylation data preprocessing and quality control platform
title_short A (fire)cloud-based DNA methylation data preprocessing and quality control platform
title_sort (fire)cloud-based dna methylation data preprocessing and quality control platform
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6440105/
https://www.ncbi.nlm.nih.gov/pubmed/30922215
http://dx.doi.org/10.1186/s12859-019-2750-4
work_keys_str_mv AT kangeyandivy afirecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT dunfordandrew afirecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT iyersowmya afirecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT stewartchip afirecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT hannamegan afirecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT getzgad afirecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT aryeemartinj afirecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT kangeyandivy firecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT dunfordandrew firecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT iyersowmya firecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT stewartchip firecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT hannamegan firecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT getzgad firecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform
AT aryeemartinj firecloudbaseddnamethylationdatapreprocessingandqualitycontrolplatform