Cargando…

Benchmarking of 4C-seq pipelines based on real and simulated data

MOTIVATION: With its capacity for high-resolution data output in one region of interest, chromosome conformation capture combined with high-throughput sequencing (4C-seq) is a state-of-the-art next-generation sequencing technique that provides epigenetic insights, and regularly advances current medi...

Descripción completa

Detalles Bibliográficos
Autores principales: Walter, Carolin, Schuetzmann, Daniel, Rosenbauer, Frank, Dugas, Martin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6901067/
https://www.ncbi.nlm.nih.gov/pubmed/31134276
http://dx.doi.org/10.1093/bioinformatics/btz426
_version_ 1783477447981793280
author Walter, Carolin
Schuetzmann, Daniel
Rosenbauer, Frank
Dugas, Martin
author_facet Walter, Carolin
Schuetzmann, Daniel
Rosenbauer, Frank
Dugas, Martin
author_sort Walter, Carolin
collection PubMed
description MOTIVATION: With its capacity for high-resolution data output in one region of interest, chromosome conformation capture combined with high-throughput sequencing (4C-seq) is a state-of-the-art next-generation sequencing technique that provides epigenetic insights, and regularly advances current medical research. However, 4C-seq data are complex and prone to biases, and while specialized programs exist, an unbiased, extensive benchmarking is still lacking. Furthermore, neither substantial datasets with fully characterized ground truth, nor simulation programs for realistic 4C-seq data have been published. RESULTS: We conducted a benchmarking study on 66 4C-seq samples from 20 datasets, and developed a novel 4C-seq simulation software, Basic4CSim, to allow for detailed comparisons of 4C-seq algorithms on 50 simulated datasets with 10–120 samples each. Simulations and benchmarking were adapted to address different characteristics of 4C-seq data. Simulated data were compared with published samples to validate simulation settings. We identified differences between 4C-seq algorithms in terms of precision, recall, interaction structure, and run time, and observed general trends. Novel differential pipeline versions of single-sample based 4C-seq algorithms were included in the benchmarking. While no single tool was optimally suited for both near-cis and far-cis, and both single-sample and differential analyses, choosing a high-performing algorithm variant did improve results considerably. For near-cis scenarios, r3Cseq, peakC and FourCSeq offered high precision, while fourSig demonstrated high overall F(1) scores in far-cis analyses. Finally, 4C-seq simulations may aid in the development of improved analysis algorithms. AVAILABILITY AND IMPLEMENTATION: Basic4CSim is available at https://github.com/walter–ca/Basic4CSim. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6901067
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-69010672019-12-16 Benchmarking of 4C-seq pipelines based on real and simulated data Walter, Carolin Schuetzmann, Daniel Rosenbauer, Frank Dugas, Martin Bioinformatics Original Papers MOTIVATION: With its capacity for high-resolution data output in one region of interest, chromosome conformation capture combined with high-throughput sequencing (4C-seq) is a state-of-the-art next-generation sequencing technique that provides epigenetic insights, and regularly advances current medical research. However, 4C-seq data are complex and prone to biases, and while specialized programs exist, an unbiased, extensive benchmarking is still lacking. Furthermore, neither substantial datasets with fully characterized ground truth, nor simulation programs for realistic 4C-seq data have been published. RESULTS: We conducted a benchmarking study on 66 4C-seq samples from 20 datasets, and developed a novel 4C-seq simulation software, Basic4CSim, to allow for detailed comparisons of 4C-seq algorithms on 50 simulated datasets with 10–120 samples each. Simulations and benchmarking were adapted to address different characteristics of 4C-seq data. Simulated data were compared with published samples to validate simulation settings. We identified differences between 4C-seq algorithms in terms of precision, recall, interaction structure, and run time, and observed general trends. Novel differential pipeline versions of single-sample based 4C-seq algorithms were included in the benchmarking. While no single tool was optimally suited for both near-cis and far-cis, and both single-sample and differential analyses, choosing a high-performing algorithm variant did improve results considerably. For near-cis scenarios, r3Cseq, peakC and FourCSeq offered high precision, while fourSig demonstrated high overall F(1) scores in far-cis analyses. Finally, 4C-seq simulations may aid in the development of improved analysis algorithms. AVAILABILITY AND IMPLEMENTATION: Basic4CSim is available at https://github.com/walter–ca/Basic4CSim. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-12-01 2019-05-27 /pmc/articles/PMC6901067/ /pubmed/31134276 http://dx.doi.org/10.1093/bioinformatics/btz426 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Walter, Carolin
Schuetzmann, Daniel
Rosenbauer, Frank
Dugas, Martin
Benchmarking of 4C-seq pipelines based on real and simulated data
title Benchmarking of 4C-seq pipelines based on real and simulated data
title_full Benchmarking of 4C-seq pipelines based on real and simulated data
title_fullStr Benchmarking of 4C-seq pipelines based on real and simulated data
title_full_unstemmed Benchmarking of 4C-seq pipelines based on real and simulated data
title_short Benchmarking of 4C-seq pipelines based on real and simulated data
title_sort benchmarking of 4c-seq pipelines based on real and simulated data
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6901067/
https://www.ncbi.nlm.nih.gov/pubmed/31134276
http://dx.doi.org/10.1093/bioinformatics/btz426
work_keys_str_mv AT waltercarolin benchmarkingof4cseqpipelinesbasedonrealandsimulateddata
AT schuetzmanndaniel benchmarkingof4cseqpipelinesbasedonrealandsimulateddata
AT rosenbauerfrank benchmarkingof4cseqpipelinesbasedonrealandsimulateddata
AT dugasmartin benchmarkingof4cseqpipelinesbasedonrealandsimulateddata