Cargando…

How to optimally sample a sequence for rapid analysis

MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlappin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Frith, Martin C, Shaw, Jim, Spouge, John L
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907223/ https://www.ncbi.nlm.nih.gov/pubmed/36702468 http://dx.doi.org/10.1093/bioinformatics/btad057

_version_	1784884132318281728
author	Frith, Martin C Shaw, Jim Spouge, John L
author_facet	Frith, Martin C Shaw, Jim Spouge, John L
author_sort	Frith, Martin C
collection	PubMed
description	MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-9907223
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-99072232023-02-09 How to optimally sample a sequence for rapid analysis Frith, Martin C Shaw, Jim Spouge, John L Bioinformatics Original Paper MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-25 /pmc/articles/PMC9907223/ /pubmed/36702468 http://dx.doi.org/10.1093/bioinformatics/btad057 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Frith, Martin C Shaw, Jim Spouge, John L How to optimally sample a sequence for rapid analysis
title	How to optimally sample a sequence for rapid analysis
title_full	How to optimally sample a sequence for rapid analysis
title_fullStr	How to optimally sample a sequence for rapid analysis
title_full_unstemmed	How to optimally sample a sequence for rapid analysis
title_short	How to optimally sample a sequence for rapid analysis
title_sort	how to optimally sample a sequence for rapid analysis
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907223/ https://www.ncbi.nlm.nih.gov/pubmed/36702468 http://dx.doi.org/10.1093/bioinformatics/btad057
work_keys_str_mv	AT frithmartinc howtooptimallysampleasequenceforrapidanalysis AT shawjim howtooptimallysampleasequenceforrapidanalysis AT spougejohnl howtooptimallysampleasequenceforrapidanalysis

How to optimally sample a sequence for rapid analysis

Ejemplares similares