Cargando…

How to optimally sample a sequence for rapid analysis

MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlappin...

Descripción completa

Detalles Bibliográficos
Autores principales: Frith, Martin C, Shaw, Jim, Spouge, John L
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907223/
https://www.ncbi.nlm.nih.gov/pubmed/36702468
http://dx.doi.org/10.1093/bioinformatics/btad057
_version_ 1784884132318281728
author Frith, Martin C
Shaw, Jim
Spouge, John L
author_facet Frith, Martin C
Shaw, Jim
Spouge, John L
author_sort Frith, Martin C
collection PubMed
description MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9907223
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-99072232023-02-09 How to optimally sample a sequence for rapid analysis Frith, Martin C Shaw, Jim Spouge, John L Bioinformatics Original Paper MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-25 /pmc/articles/PMC9907223/ /pubmed/36702468 http://dx.doi.org/10.1093/bioinformatics/btad057 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Frith, Martin C
Shaw, Jim
Spouge, John L
How to optimally sample a sequence for rapid analysis
title How to optimally sample a sequence for rapid analysis
title_full How to optimally sample a sequence for rapid analysis
title_fullStr How to optimally sample a sequence for rapid analysis
title_full_unstemmed How to optimally sample a sequence for rapid analysis
title_short How to optimally sample a sequence for rapid analysis
title_sort how to optimally sample a sequence for rapid analysis
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907223/
https://www.ncbi.nlm.nih.gov/pubmed/36702468
http://dx.doi.org/10.1093/bioinformatics/btad057
work_keys_str_mv AT frithmartinc howtooptimallysampleasequenceforrapidanalysis
AT shawjim howtooptimallysampleasequenceforrapidanalysis
AT spougejohnl howtooptimallysampleasequenceforrapidanalysis