Cargando…

Fast and compact matching statistics analytics

MOTIVATION: Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cunial, Fabio, Denas, Olgert, Belazzougui, Djamal
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9665870/ https://www.ncbi.nlm.nih.gov/pubmed/35134833 http://dx.doi.org/10.1093/bioinformatics/btac064

_version_	1784831380677459968
author	Cunial, Fabio Denas, Olgert Belazzougui, Djamal
author_facet	Cunial, Fabio Denas, Olgert Belazzougui, Djamal
author_sort	Cunial, Fabio
collection	PubMed
description	MOTIVATION: Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. RESULTS: We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. AVAILABILITY AND IMPLEMENTATION: Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-9665870
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-96658702022-11-16 Fast and compact matching statistics analytics Cunial, Fabio Denas, Olgert Belazzougui, Djamal Bioinformatics Original Papers MOTIVATION: Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. RESULTS: We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. AVAILABILITY AND IMPLEMENTATION: Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-02-04 /pmc/articles/PMC9665870/ /pubmed/35134833 http://dx.doi.org/10.1093/bioinformatics/btac064 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Cunial, Fabio Denas, Olgert Belazzougui, Djamal Fast and compact matching statistics analytics
title	Fast and compact matching statistics analytics
title_full	Fast and compact matching statistics analytics
title_fullStr	Fast and compact matching statistics analytics
title_full_unstemmed	Fast and compact matching statistics analytics
title_short	Fast and compact matching statistics analytics
title_sort	fast and compact matching statistics analytics
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9665870/ https://www.ncbi.nlm.nih.gov/pubmed/35134833 http://dx.doi.org/10.1093/bioinformatics/btac064
work_keys_str_mv	AT cunialfabio fastandcompactmatchingstatisticsanalytics AT denasolgert fastandcompactmatchingstatisticsanalytics AT belazzouguidjamal fastandcompactmatchingstatisticsanalytics

Fast and compact matching statistics analytics

Ejemplares similares