Cargando…

Genome comparison without alignment using shortest unique substrings

BACKGROUND: Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequenc...

Descripción completa

Detalles Bibliográficos
Autores principales: Haubold, Bernhard, Pierstorff, Nora, Möller, Friedrich, Wiehe, Thomas
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1166540/
https://www.ncbi.nlm.nih.gov/pubmed/15910684
http://dx.doi.org/10.1186/1471-2105-6-123
_version_ 1782124419560243200
author Haubold, Bernhard
Pierstorff, Nora
Möller, Friedrich
Wiehe, Thomas
author_facet Haubold, Bernhard
Pierstorff, Nora
Möller, Friedrich
Wiehe, Thomas
author_sort Haubold, Bernhard
collection PubMed
description BACKGROUND: Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Such substrings can be detected using generalized suffix trees. RESULTS: We find that the shortest unique substrings in Caenorhabditis elegans, human and mouse are no longer than 11 bp in the autosomes of these organisms. In mouse and human these unique substrings are significantly clustered in upstream regions of known genes. Moreover, the probability of finding such short unique substrings in the genomes of human or mouse by chance is extremely small. We derive an analytical expression for the null distribution of shortest unique substrings, given the GC-content of the query sequences. Furthermore, we apply our method to rapidly detect unique genomic regions in the genome of Staphylococcus aureus strain MSSA476 compared to four other staphylococcal genomes. CONCLUSION: We combine a method to rapidly search for shortest unique substrings in DNA sequences and a derivation of their null distribution. We show that unique regions in an arbitrary sample of genomes can be efficiently detected with this method. The corresponding programs shustring (SHortest Unique subSTRING) and shulen are written in C and available at .
format Text
id pubmed-1166540
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-11665402005-06-30 Genome comparison without alignment using shortest unique substrings Haubold, Bernhard Pierstorff, Nora Möller, Friedrich Wiehe, Thomas BMC Bioinformatics Research Article BACKGROUND: Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Such substrings can be detected using generalized suffix trees. RESULTS: We find that the shortest unique substrings in Caenorhabditis elegans, human and mouse are no longer than 11 bp in the autosomes of these organisms. In mouse and human these unique substrings are significantly clustered in upstream regions of known genes. Moreover, the probability of finding such short unique substrings in the genomes of human or mouse by chance is extremely small. We derive an analytical expression for the null distribution of shortest unique substrings, given the GC-content of the query sequences. Furthermore, we apply our method to rapidly detect unique genomic regions in the genome of Staphylococcus aureus strain MSSA476 compared to four other staphylococcal genomes. CONCLUSION: We combine a method to rapidly search for shortest unique substrings in DNA sequences and a derivation of their null distribution. We show that unique regions in an arbitrary sample of genomes can be efficiently detected with this method. The corresponding programs shustring (SHortest Unique subSTRING) and shulen are written in C and available at . BioMed Central 2005-05-23 /pmc/articles/PMC1166540/ /pubmed/15910684 http://dx.doi.org/10.1186/1471-2105-6-123 Text en Copyright © 2005 Haubold et al; licensee BioMed Central Ltd.
spellingShingle Research Article
Haubold, Bernhard
Pierstorff, Nora
Möller, Friedrich
Wiehe, Thomas
Genome comparison without alignment using shortest unique substrings
title Genome comparison without alignment using shortest unique substrings
title_full Genome comparison without alignment using shortest unique substrings
title_fullStr Genome comparison without alignment using shortest unique substrings
title_full_unstemmed Genome comparison without alignment using shortest unique substrings
title_short Genome comparison without alignment using shortest unique substrings
title_sort genome comparison without alignment using shortest unique substrings
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1166540/
https://www.ncbi.nlm.nih.gov/pubmed/15910684
http://dx.doi.org/10.1186/1471-2105-6-123
work_keys_str_mv AT hauboldbernhard genomecomparisonwithoutalignmentusingshortestuniquesubstrings
AT pierstorffnora genomecomparisonwithoutalignmentusingshortestuniquesubstrings
AT mollerfriedrich genomecomparisonwithoutalignmentusingshortestuniquesubstrings
AT wiehethomas genomecomparisonwithoutalignmentusingshortestuniquesubstrings