Cargando…

Fast and exact quantification of motif occurrences in biological sequences

BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high comput...

Descripción completa

Detalles Bibliográficos
Autores principales: Prosperi, Mattia, Marini, Simone, Boucher, Christina
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8449872/
https://www.ncbi.nlm.nih.gov/pubmed/34537012
http://dx.doi.org/10.1186/s12859-021-04355-6
_version_ 1784569504430292992
author Prosperi, Mattia
Marini, Simone
Boucher, Christina
author_facet Prosperi, Mattia
Marini, Simone
Boucher, Christina
author_sort Prosperi, Mattia
collection PubMed
description BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob. CONCLUSIONS: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
format Online
Article
Text
id pubmed-8449872
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-84498722021-09-20 Fast and exact quantification of motif occurrences in biological sequences Prosperi, Mattia Marini, Simone Boucher, Christina BMC Bioinformatics Software BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob. CONCLUSIONS: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency. BioMed Central 2021-09-18 /pmc/articles/PMC8449872/ /pubmed/34537012 http://dx.doi.org/10.1186/s12859-021-04355-6 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Prosperi, Mattia
Marini, Simone
Boucher, Christina
Fast and exact quantification of motif occurrences in biological sequences
title Fast and exact quantification of motif occurrences in biological sequences
title_full Fast and exact quantification of motif occurrences in biological sequences
title_fullStr Fast and exact quantification of motif occurrences in biological sequences
title_full_unstemmed Fast and exact quantification of motif occurrences in biological sequences
title_short Fast and exact quantification of motif occurrences in biological sequences
title_sort fast and exact quantification of motif occurrences in biological sequences
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8449872/
https://www.ncbi.nlm.nih.gov/pubmed/34537012
http://dx.doi.org/10.1186/s12859-021-04355-6
work_keys_str_mv AT prosperimattia fastandexactquantificationofmotifoccurrencesinbiologicalsequences
AT marinisimone fastandexactquantificationofmotifoccurrencesinbiologicalsequences
AT boucherchristina fastandexactquantificationofmotifoccurrencesinbiologicalsequences