Cargando…
microTaboo: a general and practical solution to the k-disjoint problem
BACKGROUND: A common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting. However, there are several areas that would benefit from a more stringent d...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5414201/ https://www.ncbi.nlm.nih.gov/pubmed/28464826 http://dx.doi.org/10.1186/s12859-017-1644-6 |
_version_ | 1783233320472018944 |
---|---|
author | Al-Jaff, Mohammed Sandström, Eric Grabherr, Manfred |
author_facet | Al-Jaff, Mohammed Sandström, Eric Grabherr, Manfred |
author_sort | Al-Jaff, Mohammed |
collection | PubMed |
description | BACKGROUND: A common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting. However, there are several areas that would benefit from a more stringent definition of “unique”, requiring that these sub-sequences of length W differ by more than k mismatches (i.e. a Hamming distance greater than k) from any other sub-sequence, which we term the k-disjoint problem. Examples include finding sequences unique to a pathogen for probe-based infection diagnostics; reducing off-target hits for re-sequencing or genome editing; detecting sequence (e.g. phage or viral) insertions; and multiple substitution mutations. Since both sensitivity and specificity are critical, an exhaustive, yet efficient solution is desirable. RESULTS: We present microTaboo, a method that allows for efficient and extensive sequence mining of unique (k-disjoint) sequences of up to 100 nucleotides in length. On a number of simulated and real data sets ranging from microbe- to mammalian-size genomes, we show that microTaboo is able to efficiently find all sub-sequences of a specified length W that do not occur within a threshold of k mismatches in any other sub-sequence. We exemplify that microTaboo has many practical applications, including point substitution detection, sequence insertion detection, padlock probe target search, and candidate CRISPR target mining. CONCLUSIONS: microTaboo implements a solution to the k-disjoint problem in an alignment- and assembly free manner. microTaboo is available for Windows, Mac OS X, and Linux, running Java 7 and higher, under the GNU GPLv3 license, at: https://MohammedAlJaff.github.io/microTaboo ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1644-6) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5414201 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-54142012017-05-03 microTaboo: a general and practical solution to the k-disjoint problem Al-Jaff, Mohammed Sandström, Eric Grabherr, Manfred BMC Bioinformatics Software BACKGROUND: A common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting. However, there are several areas that would benefit from a more stringent definition of “unique”, requiring that these sub-sequences of length W differ by more than k mismatches (i.e. a Hamming distance greater than k) from any other sub-sequence, which we term the k-disjoint problem. Examples include finding sequences unique to a pathogen for probe-based infection diagnostics; reducing off-target hits for re-sequencing or genome editing; detecting sequence (e.g. phage or viral) insertions; and multiple substitution mutations. Since both sensitivity and specificity are critical, an exhaustive, yet efficient solution is desirable. RESULTS: We present microTaboo, a method that allows for efficient and extensive sequence mining of unique (k-disjoint) sequences of up to 100 nucleotides in length. On a number of simulated and real data sets ranging from microbe- to mammalian-size genomes, we show that microTaboo is able to efficiently find all sub-sequences of a specified length W that do not occur within a threshold of k mismatches in any other sub-sequence. We exemplify that microTaboo has many practical applications, including point substitution detection, sequence insertion detection, padlock probe target search, and candidate CRISPR target mining. CONCLUSIONS: microTaboo implements a solution to the k-disjoint problem in an alignment- and assembly free manner. microTaboo is available for Windows, Mac OS X, and Linux, running Java 7 and higher, under the GNU GPLv3 license, at: https://MohammedAlJaff.github.io/microTaboo ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1644-6) contains supplementary material, which is available to authorized users. BioMed Central 2017-05-02 /pmc/articles/PMC5414201/ /pubmed/28464826 http://dx.doi.org/10.1186/s12859-017-1644-6 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Al-Jaff, Mohammed Sandström, Eric Grabherr, Manfred microTaboo: a general and practical solution to the k-disjoint problem |
title | microTaboo: a general and practical solution to the k-disjoint problem |
title_full | microTaboo: a general and practical solution to the k-disjoint problem |
title_fullStr | microTaboo: a general and practical solution to the k-disjoint problem |
title_full_unstemmed | microTaboo: a general and practical solution to the k-disjoint problem |
title_short | microTaboo: a general and practical solution to the k-disjoint problem |
title_sort | microtaboo: a general and practical solution to the k-disjoint problem |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5414201/ https://www.ncbi.nlm.nih.gov/pubmed/28464826 http://dx.doi.org/10.1186/s12859-017-1644-6 |
work_keys_str_mv | AT aljaffmohammed microtabooageneralandpracticalsolutiontothekdisjointproblem AT sandstromeric microtabooageneralandpracticalsolutiontothekdisjointproblem AT grabherrmanfred microtabooageneralandpracticalsolutiontothekdisjointproblem |