Cargando…
Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Genetics Society of America
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536857/ https://www.ncbi.nlm.nih.gov/pubmed/32816922 http://dx.doi.org/10.1534/genetics.120.303597 |
_version_ | 1783590626986557440 |
---|---|
author | Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei |
author_facet | Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei |
author_sort | Wang, Mengchi |
collection | PubMed |
description | Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification. |
format | Online Article Text |
id | pubmed-7536857 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Genetics Society of America |
record_format | MEDLINE/PubMed |
spelling | pubmed-75368572020-10-14 Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei Genetics Investigations Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification. Genetics Society of America 2020-10 2020-08-19 /pmc/articles/PMC7536857/ /pubmed/32816922 http://dx.doi.org/10.1534/genetics.120.303597 Text en Copyright © 2020 Wang et al. Available freely online through the author-supported open access option. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Investigations Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss |
title | Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss |
title_full | Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss |
title_fullStr | Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss |
title_full_unstemmed | Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss |
title_short | Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss |
title_sort | motto: representing motifs in consensus sequences with minimum information loss |
topic | Investigations |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536857/ https://www.ncbi.nlm.nih.gov/pubmed/32816922 http://dx.doi.org/10.1534/genetics.120.303597 |
work_keys_str_mv | AT wangmengchi mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT wangdavid mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT zhangkai mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT ngovu mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT fanshicai mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT wangwei mottorepresentingmotifsinconsensussequenceswithminimuminformationloss |