Cargando…

Efficient computation of absent words in genomic sequences

BACKGROUND: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genet...

Descripción completa

Detalles Bibliográficos
Autores principales:	Herold, Julia, Kurtz, Stefan, Giegerich, Robert
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375138/ https://www.ncbi.nlm.nih.gov/pubmed/18366790 http://dx.doi.org/10.1186/1471-2105-9-167

_version_	1782154586472054784
author	Herold, Julia Kurtz, Stefan Giegerich, Robert
author_facet	Herold, Julia Kurtz, Stefan Giegerich, Robert
author_sort	Herold, Julia
collection	PubMed
description	BACKGROUND: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. RESULTS: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 10(9 )down to 10(5 )bp. CONCLUSION: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.
format	Text
id	pubmed-2375138
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-23751382008-05-09 Efficient computation of absent words in genomic sequences Herold, Julia Kurtz, Stefan Giegerich, Robert BMC Bioinformatics Methodology Article BACKGROUND: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. RESULTS: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 10(9 )down to 10(5 )bp. CONCLUSION: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data. BioMed Central 2008-03-26 /pmc/articles/PMC2375138/ /pubmed/18366790 http://dx.doi.org/10.1186/1471-2105-9-167 Text en Copyright © 2008 Herold et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Herold, Julia Kurtz, Stefan Giegerich, Robert Efficient computation of absent words in genomic sequences
title	Efficient computation of absent words in genomic sequences
title_full	Efficient computation of absent words in genomic sequences
title_fullStr	Efficient computation of absent words in genomic sequences
title_full_unstemmed	Efficient computation of absent words in genomic sequences
title_short	Efficient computation of absent words in genomic sequences
title_sort	efficient computation of absent words in genomic sequences
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375138/ https://www.ncbi.nlm.nih.gov/pubmed/18366790 http://dx.doi.org/10.1186/1471-2105-9-167
work_keys_str_mv	AT heroldjulia efficientcomputationofabsentwordsingenomicsequences AT kurtzstefan efficientcomputationofabsentwordsingenomicsequences AT giegerichrobert efficientcomputationofabsentwordsingenomicsequences

Efficient computation of absent words in genomic sequences

Ejemplares similares