Cargando…

MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics

BACKGROUND: Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We...

Descripción completa

Detalles Bibliográficos
Autores principales: Bredesen, Bjørn André, Rehmsmeier, Marc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8105988/
https://www.ncbi.nlm.nih.gov/pubmed/33962556
http://dx.doi.org/10.1186/s12859-021-04143-2
_version_ 1783689688874221568
author Bredesen, Bjørn André
Rehmsmeier, Marc
author_facet Bredesen, Bjørn André
Rehmsmeier, Marc
author_sort Bredesen, Bjørn André
collection PubMed
description BACKGROUND: Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. RESULTS: We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. CONCLUSION: MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04143-2.
format Online
Article
Text
id pubmed-8105988
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-81059882021-05-10 MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics Bredesen, Bjørn André Rehmsmeier, Marc BMC Bioinformatics Software BACKGROUND: Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. RESULTS: We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. CONCLUSION: MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04143-2. BioMed Central 2021-05-07 /pmc/articles/PMC8105988/ /pubmed/33962556 http://dx.doi.org/10.1186/s12859-021-04143-2 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Bredesen, Bjørn André
Rehmsmeier, Marc
MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics
title MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics
title_full MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics
title_fullStr MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics
title_full_unstemmed MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics
title_short MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics
title_sort mocca: a flexible suite for modelling dna sequence motif occurrence combinatorics
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8105988/
https://www.ncbi.nlm.nih.gov/pubmed/33962556
http://dx.doi.org/10.1186/s12859-021-04143-2
work_keys_str_mv AT bredesenbjørnandre moccaaflexiblesuiteformodellingdnasequencemotifoccurrencecombinatorics
AT rehmsmeiermarc moccaaflexiblesuiteformodellingdnasequencemotifoccurrencecombinatorics