Cargando…

Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals

BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p bin...

Descripción completa

Detalles Bibliográficos
Autores principales: Reneker, Jeff, Shyu, Chi-Ren
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1131890/
https://www.ncbi.nlm.nih.gov/pubmed/15869708
http://dx.doi.org/10.1186/1471-2105-6-111
_version_ 1782123951975038976
author Reneker, Jeff
Shyu, Chi-Ren
author_facet Reneker, Jeff
Shyu, Chi-Ren
author_sort Reneker, Jeff
collection PubMed
description BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches – most of which are irrelevant to the researcher. RESULTS: We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes – without the main memory constraints present in other systems. CONCLUSION: We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.
format Text
id pubmed-1131890
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-11318902005-05-20 Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals Reneker, Jeff Shyu, Chi-Ren BMC Bioinformatics Research Article BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches – most of which are irrelevant to the researcher. RESULTS: We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes – without the main memory constraints present in other systems. CONCLUSION: We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request. BioMed Central 2005-05-03 /pmc/articles/PMC1131890/ /pubmed/15869708 http://dx.doi.org/10.1186/1471-2105-6-111 Text en Copyright © 2005 Reneker and Shyu; licensee BioMed Central Ltd.
spellingShingle Research Article
Reneker, Jeff
Shyu, Chi-Ren
Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_full Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_fullStr Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_full_unstemmed Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_short Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_sort refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1131890/
https://www.ncbi.nlm.nih.gov/pubmed/15869708
http://dx.doi.org/10.1186/1471-2105-6-111
work_keys_str_mv AT renekerjeff refinedrepetitivesequencesearchesutilizingafasthashfunctionandcrossspeciesinformationretrievals
AT shyuchiren refinedrepetitivesequencesearchesutilizingafasthashfunctionandcrossspeciesinformationretrievals