Cargando…
Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907136/ https://www.ncbi.nlm.nih.gov/pubmed/36508674 http://dx.doi.org/10.1073/pnas.2212810119 |
_version_ | 1784884112167796736 |
---|---|
author | Han, Seong Kyu Muto, Yoshiharu Wilson, Parker C. Humphreys, Benjamin D. Sampson, Matthew G. Chakravarti, Aravinda Lee, Dongwon |
author_facet | Han, Seong Kyu Muto, Yoshiharu Wilson, Parker C. Humphreys, Benjamin D. Sampson, Matthew G. Chakravarti, Aravinda Lee, Dongwon |
author_sort | Han, Seong Kyu |
collection | PubMed |
description | Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify “high-quality” (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data. |
format | Online Article Text |
id | pubmed-9907136 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | National Academy of Sciences |
record_format | MEDLINE/PubMed |
spelling | pubmed-99071362023-02-08 Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model Han, Seong Kyu Muto, Yoshiharu Wilson, Parker C. Humphreys, Benjamin D. Sampson, Matthew G. Chakravarti, Aravinda Lee, Dongwon Proc Natl Acad Sci U S A Biological Sciences Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify “high-quality” (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data. National Academy of Sciences 2022-12-12 2022-12-20 /pmc/articles/PMC9907136/ /pubmed/36508674 http://dx.doi.org/10.1073/pnas.2212810119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) . |
spellingShingle | Biological Sciences Han, Seong Kyu Muto, Yoshiharu Wilson, Parker C. Humphreys, Benjamin D. Sampson, Matthew G. Chakravarti, Aravinda Lee, Dongwon Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model |
title | Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model |
title_full | Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model |
title_fullStr | Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model |
title_full_unstemmed | Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model |
title_short | Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model |
title_sort | quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model |
topic | Biological Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907136/ https://www.ncbi.nlm.nih.gov/pubmed/36508674 http://dx.doi.org/10.1073/pnas.2212810119 |
work_keys_str_mv | AT hanseongkyu qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel AT mutoyoshiharu qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel AT wilsonparkerc qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel AT humphreysbenjamind qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel AT sampsonmatthewg qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel AT chakravartiaravinda qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel AT leedongwon qualityassessmentandrefinementofchromatinaccessibilitydatausingasequencebasedpredictivemodel |