Cargando…

Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Mengchi, Wang, David, Zhang, Kai, Ngo, Vu, Fan, Shicai, Wang, Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Genetics Society of America 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536857/
https://www.ncbi.nlm.nih.gov/pubmed/32816922
http://dx.doi.org/10.1534/genetics.120.303597
_version_ 1783590626986557440
author Wang, Mengchi
Wang, David
Zhang, Kai
Ngo, Vu
Fan, Shicai
Wang, Wei
author_facet Wang, Mengchi
Wang, David
Zhang, Kai
Ngo, Vu
Fan, Shicai
Wang, Wei
author_sort Wang, Mengchi
collection PubMed
description Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.
format Online
Article
Text
id pubmed-7536857
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Genetics Society of America
record_format MEDLINE/PubMed
spelling pubmed-75368572020-10-14 Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei Genetics Investigations Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification. Genetics Society of America 2020-10 2020-08-19 /pmc/articles/PMC7536857/ /pubmed/32816922 http://dx.doi.org/10.1534/genetics.120.303597 Text en Copyright © 2020 Wang et al. Available freely online through the author-supported open access option. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Investigations
Wang, Mengchi
Wang, David
Zhang, Kai
Ngo, Vu
Fan, Shicai
Wang, Wei
Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_full Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_fullStr Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_full_unstemmed Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_short Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_sort motto: representing motifs in consensus sequences with minimum information loss
topic Investigations
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536857/
https://www.ncbi.nlm.nih.gov/pubmed/32816922
http://dx.doi.org/10.1534/genetics.120.303597
work_keys_str_mv AT wangmengchi mottorepresentingmotifsinconsensussequenceswithminimuminformationloss
AT wangdavid mottorepresentingmotifsinconsensussequenceswithminimuminformationloss
AT zhangkai mottorepresentingmotifsinconsensussequenceswithminimuminformationloss
AT ngovu mottorepresentingmotifsinconsensussequenceswithminimuminformationloss
AT fanshicai mottorepresentingmotifsinconsensussequenceswithminimuminformationloss
AT wangwei mottorepresentingmotifsinconsensussequenceswithminimuminformationloss