Cargando…

Comparison of discriminative motif optimization using matrix and DNA shape-based models

BACKGROUND: Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site’s activity. The independence assumption is known to be an approximation, often a good one...

Descripción completa

Detalles Bibliográficos
Autores principales: Ruan, Shuxiang, Stormo, Gary D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5840810/
https://www.ncbi.nlm.nih.gov/pubmed/29510689
http://dx.doi.org/10.1186/s12859-018-2104-7
_version_ 1783304647476248576
author Ruan, Shuxiang
Stormo, Gary D.
author_facet Ruan, Shuxiang
Stormo, Gary D.
author_sort Ruan, Shuxiang
collection PubMed
description BACKGROUND: Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site’s activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use k-mers (DNA “words” of length k) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models. RESULTS: We describe a program “Discriminative Additive Model Optimization” (DAMO) that uses positive and negative examples, as in ChIP-seq data, and finds the additive position weight matrix (PWM) that maximizes the Area Under the Receiver Operating Characteristic Curve (AUROC). We compare to a recent study where structural parameters, serving as features in a gradient boosting classifier algorithm, are shown to improve the AUROC over JASPAR position frequency matrices (PFMs). In agreement with the previous results, we find that adding structural parameters gives the largest improvement, but most of the gain can be obtained by an optimized PWM and nearly all of the gain can be obtained with a di-nucleotide extension to the PWM. CONCLUSION: To appropriately compare different models for TF bind sites, optimized models must be used. PWMs and their extensions are good representations of binding specificity for most TFs, and more complex models, including the incorporation of DNA shape features and gradient boosting classifiers, provide only moderate improvements for a few TFs. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2104-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5840810
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-58408102018-03-14 Comparison of discriminative motif optimization using matrix and DNA shape-based models Ruan, Shuxiang Stormo, Gary D. BMC Bioinformatics Methodology Article BACKGROUND: Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site’s activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use k-mers (DNA “words” of length k) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models. RESULTS: We describe a program “Discriminative Additive Model Optimization” (DAMO) that uses positive and negative examples, as in ChIP-seq data, and finds the additive position weight matrix (PWM) that maximizes the Area Under the Receiver Operating Characteristic Curve (AUROC). We compare to a recent study where structural parameters, serving as features in a gradient boosting classifier algorithm, are shown to improve the AUROC over JASPAR position frequency matrices (PFMs). In agreement with the previous results, we find that adding structural parameters gives the largest improvement, but most of the gain can be obtained by an optimized PWM and nearly all of the gain can be obtained with a di-nucleotide extension to the PWM. CONCLUSION: To appropriately compare different models for TF bind sites, optimized models must be used. PWMs and their extensions are good representations of binding specificity for most TFs, and more complex models, including the incorporation of DNA shape features and gradient boosting classifiers, provide only moderate improvements for a few TFs. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2104-7) contains supplementary material, which is available to authorized users. BioMed Central 2018-03-06 /pmc/articles/PMC5840810/ /pubmed/29510689 http://dx.doi.org/10.1186/s12859-018-2104-7 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Ruan, Shuxiang
Stormo, Gary D.
Comparison of discriminative motif optimization using matrix and DNA shape-based models
title Comparison of discriminative motif optimization using matrix and DNA shape-based models
title_full Comparison of discriminative motif optimization using matrix and DNA shape-based models
title_fullStr Comparison of discriminative motif optimization using matrix and DNA shape-based models
title_full_unstemmed Comparison of discriminative motif optimization using matrix and DNA shape-based models
title_short Comparison of discriminative motif optimization using matrix and DNA shape-based models
title_sort comparison of discriminative motif optimization using matrix and dna shape-based models
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5840810/
https://www.ncbi.nlm.nih.gov/pubmed/29510689
http://dx.doi.org/10.1186/s12859-018-2104-7
work_keys_str_mv AT ruanshuxiang comparisonofdiscriminativemotifoptimizationusingmatrixanddnashapebasedmodels
AT stormogaryd comparisonofdiscriminativemotifoptimizationusingmatrixanddnashapebasedmodels