Cargando…

Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation

MOTIVATION: Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-...

Descripción completa

Detalles Bibliográficos
Autores principales: Albrecht, Steffen, Andreani, Tommaso, Andrade-Navarro, Miguel A., Fontaine, Jean Fred
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9249201/
https://www.ncbi.nlm.nih.gov/pubmed/35776722
http://dx.doi.org/10.1371/journal.pone.0270043
_version_ 1784739525705072640
author Albrecht, Steffen
Andreani, Tommaso
Andrade-Navarro, Miguel A.
Fontaine, Jean Fred
author_facet Albrecht, Steffen
Andreani, Tommaso
Andrade-Navarro, Miguel A.
Fontaine, Jean Fred
author_sort Albrecht, Steffen
collection PubMed
description MOTIVATION: Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors. RESULTS: Imputations using machine learning models trained for each single cell, each ChIP protein target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real human data. Results on bulk data simulating single cells show that the imputations are single-cell specific as the imputed profiles are closer to the simulated cell than to other cells related to the same ChIP protein target and the same cell type. Simulations also show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways in 2 real human and mouse datasets. The SIMPA’s interpretable imputation method allows users to gain a deep understanding of individual cells and, consequently, of sparse scChIP-seq datasets. AVAILABILITY AND IMPLEMENTATION: Our interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPA.
format Online
Article
Text
id pubmed-9249201
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-92492012022-07-02 Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation Albrecht, Steffen Andreani, Tommaso Andrade-Navarro, Miguel A. Fontaine, Jean Fred PLoS One Research Article MOTIVATION: Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors. RESULTS: Imputations using machine learning models trained for each single cell, each ChIP protein target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real human data. Results on bulk data simulating single cells show that the imputations are single-cell specific as the imputed profiles are closer to the simulated cell than to other cells related to the same ChIP protein target and the same cell type. Simulations also show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways in 2 real human and mouse datasets. The SIMPA’s interpretable imputation method allows users to gain a deep understanding of individual cells and, consequently, of sparse scChIP-seq datasets. AVAILABILITY AND IMPLEMENTATION: Our interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPA. Public Library of Science 2022-07-01 /pmc/articles/PMC9249201/ /pubmed/35776722 http://dx.doi.org/10.1371/journal.pone.0270043 Text en © 2022 Albrecht et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Albrecht, Steffen
Andreani, Tommaso
Andrade-Navarro, Miguel A.
Fontaine, Jean Fred
Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation
title Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation
title_full Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation
title_fullStr Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation
title_full_unstemmed Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation
title_short Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation
title_sort single-cell specific and interpretable machine learning models for sparse scchip-seq data imputation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9249201/
https://www.ncbi.nlm.nih.gov/pubmed/35776722
http://dx.doi.org/10.1371/journal.pone.0270043
work_keys_str_mv AT albrechtsteffen singlecellspecificandinterpretablemachinelearningmodelsforsparsescchipseqdataimputation
AT andreanitommaso singlecellspecificandinterpretablemachinelearningmodelsforsparsescchipseqdataimputation
AT andradenavarromiguela singlecellspecificandinterpretablemachinelearningmodelsforsparsescchipseqdataimputation
AT fontainejeanfred singlecellspecificandinterpretablemachinelearningmodelsforsparsescchipseqdataimputation