Cargando…
How to optimally sample a sequence for rapid analysis
MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlappin...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907223/ https://www.ncbi.nlm.nih.gov/pubmed/36702468 http://dx.doi.org/10.1093/bioinformatics/btad057 |
_version_ | 1784884132318281728 |
---|---|
author | Frith, Martin C Shaw, Jim Spouge, John L |
author_facet | Frith, Martin C Shaw, Jim Spouge, John L |
author_sort | Frith, Martin C |
collection | PubMed |
description | MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-9907223 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-99072232023-02-09 How to optimally sample a sequence for rapid analysis Frith, Martin C Shaw, Jim Spouge, John L Bioinformatics Original Paper MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-25 /pmc/articles/PMC9907223/ /pubmed/36702468 http://dx.doi.org/10.1093/bioinformatics/btad057 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Frith, Martin C Shaw, Jim Spouge, John L How to optimally sample a sequence for rapid analysis |
title | How to optimally sample a sequence for rapid analysis |
title_full | How to optimally sample a sequence for rapid analysis |
title_fullStr | How to optimally sample a sequence for rapid analysis |
title_full_unstemmed | How to optimally sample a sequence for rapid analysis |
title_short | How to optimally sample a sequence for rapid analysis |
title_sort | how to optimally sample a sequence for rapid analysis |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9907223/ https://www.ncbi.nlm.nih.gov/pubmed/36702468 http://dx.doi.org/10.1093/bioinformatics/btad057 |
work_keys_str_mv | AT frithmartinc howtooptimallysampleasequenceforrapidanalysis AT shawjim howtooptimallysampleasequenceforrapidanalysis AT spougejohnl howtooptimallysampleasequenceforrapidanalysis |