Cargando…

Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model

Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence...

Descripción completa

Detalles Bibliográficos
Autores principales: Han, Seong Kyu, Muto, Yoshiharu, Wilson, Parker C., Humphreys, Benjamin D., Sampson, Matthew G., Chakravarti, Aravinda, Lee, Dongwon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907136/
https://www.ncbi.nlm.nih.gov/pubmed/36508674
http://dx.doi.org/10.1073/pnas.2212810119
_version_ 1784884112167796736
author Han, Seong Kyu
Muto, Yoshiharu
Wilson, Parker C.
Humphreys, Benjamin D.
Sampson, Matthew G.
Chakravarti, Aravinda
Lee, Dongwon
author_facet Han, Seong Kyu
Muto, Yoshiharu
Wilson, Parker C.
Humphreys, Benjamin D.
Sampson, Matthew G.
Chakravarti, Aravinda
Lee, Dongwon
author_sort Han, Seong Kyu
collection PubMed
description Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify “high-quality” (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
format Online
Article
Text
id pubmed-9907136
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-99071362023-02-08 Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model Han, Seong Kyu Muto, Yoshiharu Wilson, Parker C. Humphreys, Benjamin D. Sampson, Matthew G. Chakravarti, Aravinda Lee, Dongwon Proc Natl Acad Sci U S A Biological Sciences Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify “high-quality” (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data. National Academy of Sciences 2022-12-12 2022-12-20 /pmc/articles/PMC9907136/ /pubmed/36508674 http://dx.doi.org/10.1073/pnas.2212810119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Han, Seong Kyu
Muto, Yoshiharu
Wilson, Parker C.
Humphreys, Benjamin D.
Sampson, Matthew G.
Chakravarti, Aravinda
Lee, Dongwon
Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
title Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
title_full Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
title_fullStr Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
title_full_unstemmed Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
title_short Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
title_sort quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907136/
https://www.ncbi.nlm.nih.gov/pubmed/36508674
http://dx.doi.org/10.1073/pnas.2212810119
work_keys_str_mv AT hanseongkyu qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel
AT mutoyoshiharu qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel
AT wilsonparkerc qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel
AT humphreysbenjamind qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel
AT sampsonmatthewg qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel
AT chakravartiaravinda qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel
AT leedongwon qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel