Cargando…

Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals

BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p bin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Reneker, Jeff, Shyu, Chi-Ren
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2005
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1131890/ https://www.ncbi.nlm.nih.gov/pubmed/15869708 http://dx.doi.org/10.1186/1471-2105-6-111

_version_	1782123951975038976
author	Reneker, Jeff Shyu, Chi-Ren
author_facet	Reneker, Jeff Shyu, Chi-Ren
author_sort	Reneker, Jeff
collection	PubMed
description	BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches – most of which are irrelevant to the researcher. RESULTS: We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes – without the main memory constraints present in other systems. CONCLUSION: We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.
format	Text
id	pubmed-1131890
institution	National Center for Biotechnology Information
language	English
publishDate	2005
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-11318902005-05-20 Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals Reneker, Jeff Shyu, Chi-Ren BMC Bioinformatics Research Article BACKGROUND: Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches – most of which are irrelevant to the researcher. RESULTS: We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes – without the main memory constraints present in other systems. CONCLUSION: We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request. BioMed Central 2005-05-03 /pmc/articles/PMC1131890/ /pubmed/15869708 http://dx.doi.org/10.1186/1471-2105-6-111 Text en Copyright © 2005 Reneker and Shyu; licensee BioMed Central Ltd.
spellingShingle	Research Article Reneker, Jeff Shyu, Chi-Ren Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title	Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_full	Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_fullStr	Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_full_unstemmed	Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_short	Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
title_sort	refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1131890/ https://www.ncbi.nlm.nih.gov/pubmed/15869708 http://dx.doi.org/10.1186/1471-2105-6-111
work_keys_str_mv	AT renekerjeff refinedrepetitivesequencesearchesutilizingafasthashfunctionandcrossspeciesinformationretrievals AT shyuchiren refinedrepetitivesequencesearchesutilizingafasthashfunctionandcrossspeciesinformationretrievals

Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals

Ejemplares similares