Cargando…

Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation

Core promoters are stretches of DNA at the beginning of genes that contain information that facilitates the binding of transcription initiation complexes. Different functional subsets of genes have core promoters with distinct architectures and characteristic motifs. Some of these motifs inform the...

Descripción completa

Detalles Bibliográficos
Autores principales: Nikumbh, Sarvesh, Lenhard, Boris
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10695386/
https://www.ncbi.nlm.nih.gov/pubmed/37983292
http://dx.doi.org/10.1371/journal.pcbi.1011491
_version_ 1785153555430113280
author Nikumbh, Sarvesh
Lenhard, Boris
author_facet Nikumbh, Sarvesh
Lenhard, Boris
author_sort Nikumbh, Sarvesh
collection PubMed
description Core promoters are stretches of DNA at the beginning of genes that contain information that facilitates the binding of transcription initiation complexes. Different functional subsets of genes have core promoters with distinct architectures and characteristic motifs. Some of these motifs inform the selection of transcription start sites (TSS). By discovering motifs with fixed distances from known TSS positions, we could in principle classify promoters into different functional groups. Due to the variability and overlap of architectures, promoter classification is a difficult task that requires new approaches. In this study, we present a new method based on non-negative matrix factorisation (NMF) and the associated software called seqArchR that clusters promoter sequences based on their motifs at near-fixed distances from a reference point, such as TSS. When combined with experimental data from CAGE, seqArchR can efficiently identify TSS-directing motifs, including known ones like TATA, DPE, and nucleosome positioning signal, as well as novel lineage-specific motifs and the function of genes associated with them. By using seqArchR on developmental time courses, we reveal how relative use of promoter architectures changes over time with stage-specific expression. seqArchR is a powerful tool for initial genome-wide classification and functional characterisation of promoters. Its use cases are more general: it can also be used to discover any motifs at near-fixed distances from a reference point, even if they are present in only a small subset of sequences.
format Online
Article
Text
id pubmed-10695386
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-106953862023-12-05 Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation Nikumbh, Sarvesh Lenhard, Boris PLoS Comput Biol Methods Core promoters are stretches of DNA at the beginning of genes that contain information that facilitates the binding of transcription initiation complexes. Different functional subsets of genes have core promoters with distinct architectures and characteristic motifs. Some of these motifs inform the selection of transcription start sites (TSS). By discovering motifs with fixed distances from known TSS positions, we could in principle classify promoters into different functional groups. Due to the variability and overlap of architectures, promoter classification is a difficult task that requires new approaches. In this study, we present a new method based on non-negative matrix factorisation (NMF) and the associated software called seqArchR that clusters promoter sequences based on their motifs at near-fixed distances from a reference point, such as TSS. When combined with experimental data from CAGE, seqArchR can efficiently identify TSS-directing motifs, including known ones like TATA, DPE, and nucleosome positioning signal, as well as novel lineage-specific motifs and the function of genes associated with them. By using seqArchR on developmental time courses, we reveal how relative use of promoter architectures changes over time with stage-specific expression. seqArchR is a powerful tool for initial genome-wide classification and functional characterisation of promoters. Its use cases are more general: it can also be used to discover any motifs at near-fixed distances from a reference point, even if they are present in only a small subset of sequences. Public Library of Science 2023-11-20 /pmc/articles/PMC10695386/ /pubmed/37983292 http://dx.doi.org/10.1371/journal.pcbi.1011491 Text en © 2023 Nikumbh, Lenhard https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Methods
Nikumbh, Sarvesh
Lenhard, Boris
Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
title Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
title_full Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
title_fullStr Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
title_full_unstemmed Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
title_short Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
title_sort identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10695386/
https://www.ncbi.nlm.nih.gov/pubmed/37983292
http://dx.doi.org/10.1371/journal.pcbi.1011491
work_keys_str_mv AT nikumbhsarvesh identifyingpromotersequencearchitecturesviaachunkingbasedalgorithmusingnonnegativematrixfactorisation
AT lenhardboris identifyingpromotersequencearchitecturesviaachunkingbasedalgorithmusingnonnegativematrixfactorisation