Cargando…
MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics
BACKGROUND: Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8105988/ https://www.ncbi.nlm.nih.gov/pubmed/33962556 http://dx.doi.org/10.1186/s12859-021-04143-2 |
_version_ | 1783689688874221568 |
---|---|
author | Bredesen, Bjørn André Rehmsmeier, Marc |
author_facet | Bredesen, Bjørn André Rehmsmeier, Marc |
author_sort | Bredesen, Bjørn André |
collection | PubMed |
description | BACKGROUND: Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. RESULTS: We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. CONCLUSION: MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04143-2. |
format | Online Article Text |
id | pubmed-8105988 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-81059882021-05-10 MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics Bredesen, Bjørn André Rehmsmeier, Marc BMC Bioinformatics Software BACKGROUND: Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. RESULTS: We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. CONCLUSION: MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04143-2. BioMed Central 2021-05-07 /pmc/articles/PMC8105988/ /pubmed/33962556 http://dx.doi.org/10.1186/s12859-021-04143-2 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Bredesen, Bjørn André Rehmsmeier, Marc MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics |
title | MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics |
title_full | MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics |
title_fullStr | MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics |
title_full_unstemmed | MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics |
title_short | MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics |
title_sort | mocca: a flexible suite for modelling dna sequence motif occurrence combinatorics |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8105988/ https://www.ncbi.nlm.nih.gov/pubmed/33962556 http://dx.doi.org/10.1186/s12859-021-04143-2 |
work_keys_str_mv | AT bredesenbjørnandre moccaaflexiblesuiteformodellingdnasequencemotifoccurrencecombinatorics AT rehmsmeiermarc moccaaflexiblesuiteformodellingdnasequencemotifoccurrencecombinatorics |