Cargando…

Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Mengchi, Wang, David, Zhang, Kai, Ngo, Vu, Fan, Shicai, Wang, Wei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Genetics Society of America 2020
Materias:	Investigations
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536857/ https://www.ncbi.nlm.nih.gov/pubmed/32816922 http://dx.doi.org/10.1534/genetics.120.303597

_version_	1783590626986557440
author	Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei
author_facet	Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei
author_sort	Wang, Mengchi
collection	PubMed
description	Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.
format	Online Article Text
id	pubmed-7536857
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Genetics Society of America
record_format	MEDLINE/PubMed
spelling	pubmed-75368572020-10-14 Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei Genetics Investigations Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification. Genetics Society of America 2020-10 2020-08-19 /pmc/articles/PMC7536857/ /pubmed/32816922 http://dx.doi.org/10.1534/genetics.120.303597 Text en Copyright © 2020 Wang et al. Available freely online through the author-supported open access option. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Investigations Wang, Mengchi Wang, David Zhang, Kai Ngo, Vu Fan, Shicai Wang, Wei Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title	Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_full	Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_fullStr	Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_full_unstemmed	Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_short	Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss
title_sort	motto: representing motifs in consensus sequences with minimum information loss
topic	Investigations
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536857/ https://www.ncbi.nlm.nih.gov/pubmed/32816922 http://dx.doi.org/10.1534/genetics.120.303597
work_keys_str_mv	AT wangmengchi mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT wangdavid mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT zhangkai mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT ngovu mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT fanshicai mottorepresentingmotifsinconsensussequenceswithminimuminformationloss AT wangwei mottorepresentingmotifsinconsensussequenceswithminimuminformationloss

Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss

Ejemplares similares