Cargando…

Mining Unique-m Substrings from Genomes

Unique substrings in genomes may indicate high level of specificity which is crucial and fundamental to many genetics studies, such as PCR, microarray hybridization, Southern and Northern blotting, RNA interference (RNAi), and genome (re)sequencing. However, being unique sequence in the genome alone...

Descripción completa

Detalles Bibliográficos
Autores principales: Ye, Kai, Jia, Zhenyu, Wang, Yipeng, Flicek, Paul, Apweiler, Rolf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5894807/
https://www.ncbi.nlm.nih.gov/pubmed/29657484
http://dx.doi.org/10.4172/jpb.1000127
_version_ 1783313558530949120
author Ye, Kai
Jia, Zhenyu
Wang, Yipeng
Flicek, Paul
Apweiler, Rolf
author_facet Ye, Kai
Jia, Zhenyu
Wang, Yipeng
Flicek, Paul
Apweiler, Rolf
author_sort Ye, Kai
collection PubMed
description Unique substrings in genomes may indicate high level of specificity which is crucial and fundamental to many genetics studies, such as PCR, microarray hybridization, Southern and Northern blotting, RNA interference (RNAi), and genome (re)sequencing. However, being unique sequence in the genome alone is not adequate to guaranty high specificity. For example, nucleotides mismatches within a certain tolerance may impair specificity even if an interested substring occur only once in the genome. In this study we propose the concept of unique-m substrings of genomes for controlling specificity in genome-wide assays. A unique-m substring is defined if it only has a single perfect match on one strand of the entire genome while all other approximate matches must have more than m mismatches. We developed a pattern growth approach to systematically mine such unique-m substrings from a given genome. Our algorithm does not need a pre-processing step to extract sequential information which is required by most of other rival methods. The search for unique-m substrings from genomes is performed as a single task of regular data mining so that the similarities among queries are utilized to achieve tremendous speedup. The runtime of our algorithm is linear to the sizes of input genomes and the length of unique-m substrings. In addition, the unique-m mining algorithm has been parallelized to facilitate genome-wide computation on a cluster or a single machine of multiple CPUs with shared memory.
format Online
Article
Text
id pubmed-5894807
institution National Center for Biotechnology Information
language English
publishDate 2010
record_format MEDLINE/PubMed
spelling pubmed-58948072018-04-11 Mining Unique-m Substrings from Genomes Ye, Kai Jia, Zhenyu Wang, Yipeng Flicek, Paul Apweiler, Rolf J Proteomics Bioinform Article Unique substrings in genomes may indicate high level of specificity which is crucial and fundamental to many genetics studies, such as PCR, microarray hybridization, Southern and Northern blotting, RNA interference (RNAi), and genome (re)sequencing. However, being unique sequence in the genome alone is not adequate to guaranty high specificity. For example, nucleotides mismatches within a certain tolerance may impair specificity even if an interested substring occur only once in the genome. In this study we propose the concept of unique-m substrings of genomes for controlling specificity in genome-wide assays. A unique-m substring is defined if it only has a single perfect match on one strand of the entire genome while all other approximate matches must have more than m mismatches. We developed a pattern growth approach to systematically mine such unique-m substrings from a given genome. Our algorithm does not need a pre-processing step to extract sequential information which is required by most of other rival methods. The search for unique-m substrings from genomes is performed as a single task of regular data mining so that the similarities among queries are utilized to achieve tremendous speedup. The runtime of our algorithm is linear to the sizes of input genomes and the length of unique-m substrings. In addition, the unique-m mining algorithm has been parallelized to facilitate genome-wide computation on a cluster or a single machine of multiple CPUs with shared memory. 2010-03-16 /pmc/articles/PMC5894807/ /pubmed/29657484 http://dx.doi.org/10.4172/jpb.1000127 Text en http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Ye, Kai
Jia, Zhenyu
Wang, Yipeng
Flicek, Paul
Apweiler, Rolf
Mining Unique-m Substrings from Genomes
title Mining Unique-m Substrings from Genomes
title_full Mining Unique-m Substrings from Genomes
title_fullStr Mining Unique-m Substrings from Genomes
title_full_unstemmed Mining Unique-m Substrings from Genomes
title_short Mining Unique-m Substrings from Genomes
title_sort mining unique-m substrings from genomes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5894807/
https://www.ncbi.nlm.nih.gov/pubmed/29657484
http://dx.doi.org/10.4172/jpb.1000127
work_keys_str_mv AT yekai mininguniquemsubstringsfromgenomes
AT jiazhenyu mininguniquemsubstringsfromgenomes
AT wangyipeng mininguniquemsubstringsfromgenomes
AT flicekpaul mininguniquemsubstringsfromgenomes
AT apweilerrolf mininguniquemsubstringsfromgenomes