Cargando…

Fast and exact quantification of motif occurrences in biological sequences

BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high comput...

Descripción completa

Detalles Bibliográficos
Autores principales:	Prosperi, Mattia, Marini, Simone, Boucher, Christina
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8449872/ https://www.ncbi.nlm.nih.gov/pubmed/34537012 http://dx.doi.org/10.1186/s12859-021-04355-6

_version_	1784569504430292992
author	Prosperi, Mattia Marini, Simone Boucher, Christina
author_facet	Prosperi, Mattia Marini, Simone Boucher, Christina
author_sort	Prosperi, Mattia
collection	PubMed
description	BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob. CONCLUSIONS: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
format	Online Article Text
id	pubmed-8449872
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-84498722021-09-20 Fast and exact quantification of motif occurrences in biological sequences Prosperi, Mattia Marini, Simone Boucher, Christina BMC Bioinformatics Software BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob. CONCLUSIONS: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency. BioMed Central 2021-09-18 /pmc/articles/PMC8449872/ /pubmed/34537012 http://dx.doi.org/10.1186/s12859-021-04355-6 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Software Prosperi, Mattia Marini, Simone Boucher, Christina Fast and exact quantification of motif occurrences in biological sequences
title	Fast and exact quantification of motif occurrences in biological sequences
title_full	Fast and exact quantification of motif occurrences in biological sequences
title_fullStr	Fast and exact quantification of motif occurrences in biological sequences
title_full_unstemmed	Fast and exact quantification of motif occurrences in biological sequences
title_short	Fast and exact quantification of motif occurrences in biological sequences
title_sort	fast and exact quantification of motif occurrences in biological sequences
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8449872/ https://www.ncbi.nlm.nih.gov/pubmed/34537012 http://dx.doi.org/10.1186/s12859-021-04355-6
work_keys_str_mv	AT prosperimattia fastandexactquantificationofmotifoccurrencesinbiologicalsequences AT marinisimone fastandexactquantificationofmotifoccurrencesinbiologicalsequences AT boucherchristina fastandexactquantificationofmotifoccurrencesinbiologicalsequences

Fast and exact quantification of motif occurrences in biological sequences

Ejemplares similares