Cargando…

qc3C: Reference-free quality control for Hi-C sequencing data

Hi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies...

Descripción completa

Detalles Bibliográficos
Autores principales: DeMaere, Matthew Z., Darling, Aaron E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8530316/
https://www.ncbi.nlm.nih.gov/pubmed/34634030
http://dx.doi.org/10.1371/journal.pcbi.1008839
_version_ 1784586645934178304
author DeMaere, Matthew Z.
Darling, Aaron E.
author_facet DeMaere, Matthew Z.
Darling, Aaron E.
author_sort DeMaere, Matthew Z.
collection PubMed
description Hi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs). Despite continued refinements, however, preparing a Hi-C library remains a complex laboratory protocol. To avoid costly failures and maximise the odds of successful outcomes, diligent quality management is recommended. Current wet-lab methods provide only a crude assay of Hi-C library quality, while key post-sequencing quality indicators used have—thus far—relied upon reference-based read-mapping. When a reference is accessible, this reliance introduces a concern for quality, where an incomplete or inexact reference skews the resulting quality indicators. We propose a new, reference-free approach that infers the total fraction of read-pairs that are a product of proximity ligation. This quantification of Hi-C library quality requires only a modest amount of sequencing data and is independent of other application-specific criteria. The algorithm builds upon the observation that proximity ligation events are likely to create k-mers that would not naturally occur in the sample. Our software tool (qc3C) is to our knowledge the first to implement a reference-free Hi-C QC tool, and also provides reference-based QC, enabling Hi-C to be more easily applied to non-model organisms and environmental samples. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods.
format Online
Article
Text
id pubmed-8530316
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-85303162021-10-22 qc3C: Reference-free quality control for Hi-C sequencing data DeMaere, Matthew Z. Darling, Aaron E. PLoS Comput Biol Research Article Hi-C is a sample preparation method that enables high-throughput sequencing to capture genome-wide spatial interactions between DNA molecules. The technique has been successfully applied to solve challenging problems such as 3D structural analysis of chromatin, scaffolding of large genome assemblies and more recently the accurate resolution of metagenome-assembled genomes (MAGs). Despite continued refinements, however, preparing a Hi-C library remains a complex laboratory protocol. To avoid costly failures and maximise the odds of successful outcomes, diligent quality management is recommended. Current wet-lab methods provide only a crude assay of Hi-C library quality, while key post-sequencing quality indicators used have—thus far—relied upon reference-based read-mapping. When a reference is accessible, this reliance introduces a concern for quality, where an incomplete or inexact reference skews the resulting quality indicators. We propose a new, reference-free approach that infers the total fraction of read-pairs that are a product of proximity ligation. This quantification of Hi-C library quality requires only a modest amount of sequencing data and is independent of other application-specific criteria. The algorithm builds upon the observation that proximity ligation events are likely to create k-mers that would not naturally occur in the sample. Our software tool (qc3C) is to our knowledge the first to implement a reference-free Hi-C QC tool, and also provides reference-based QC, enabling Hi-C to be more easily applied to non-model organisms and environmental samples. We characterise the accuracy of the new algorithm on simulated and real datasets and compare it to reference-based methods. Public Library of Science 2021-10-11 /pmc/articles/PMC8530316/ /pubmed/34634030 http://dx.doi.org/10.1371/journal.pcbi.1008839 Text en © 2021 DeMaere, Darling https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
DeMaere, Matthew Z.
Darling, Aaron E.
qc3C: Reference-free quality control for Hi-C sequencing data
title qc3C: Reference-free quality control for Hi-C sequencing data
title_full qc3C: Reference-free quality control for Hi-C sequencing data
title_fullStr qc3C: Reference-free quality control for Hi-C sequencing data
title_full_unstemmed qc3C: Reference-free quality control for Hi-C sequencing data
title_short qc3C: Reference-free quality control for Hi-C sequencing data
title_sort qc3c: reference-free quality control for hi-c sequencing data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8530316/
https://www.ncbi.nlm.nih.gov/pubmed/34634030
http://dx.doi.org/10.1371/journal.pcbi.1008839
work_keys_str_mv AT demaerematthewz qc3creferencefreequalitycontrolforhicsequencingdata
AT darlingaarone qc3creferencefreequalitycontrolforhicsequencingdata