Cargando…

A genome alignment algorithm based on compression

BACKGROUND: Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, t...

Descripción completa

Detalles Bibliográficos
Autores principales: Cao, Minh Duc, Dix, Trevor I, Allison, Lloyd
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3022628/
https://www.ncbi.nlm.nih.gov/pubmed/21159205
http://dx.doi.org/10.1186/1471-2105-11-599
_version_ 1782196536674877440
author Cao, Minh Duc
Dix, Trevor I
Allison, Lloyd
author_facet Cao, Minh Duc
Dix, Trevor I
Allison, Lloyd
author_sort Cao, Minh Duc
collection PubMed
description BACKGROUND: Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique. RESULTS: Experiments on both simulated data and real data show that XMAligner is superior to conventional methods especially on distantly related sequences and statistically biased data. XMAligner can align sequences of eukaryote genome size with only a modest hardware requirement. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The alignment results from XMAligner can be integrated into a visualisation tool for viewing purpose. CONCLUSIONS: The information-theoretic approach for sequence alignment is shown to overcome the mentioned problems of conventional character matching alignment methods. The article shows that, as genomic sequences are meant to carry information, considering the information content of nucleotides is helpful for genomic sequence alignment. AVAILABILITY: Downloadable binaries, documentation and data can be found at ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/.
format Text
id pubmed-3022628
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30226282011-01-20 A genome alignment algorithm based on compression Cao, Minh Duc Dix, Trevor I Allison, Lloyd BMC Bioinformatics Methodology Article BACKGROUND: Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique. RESULTS: Experiments on both simulated data and real data show that XMAligner is superior to conventional methods especially on distantly related sequences and statistically biased data. XMAligner can align sequences of eukaryote genome size with only a modest hardware requirement. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The alignment results from XMAligner can be integrated into a visualisation tool for viewing purpose. CONCLUSIONS: The information-theoretic approach for sequence alignment is shown to overcome the mentioned problems of conventional character matching alignment methods. The article shows that, as genomic sequences are meant to carry information, considering the information content of nucleotides is helpful for genomic sequence alignment. AVAILABILITY: Downloadable binaries, documentation and data can be found at ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/. BioMed Central 2010-12-16 /pmc/articles/PMC3022628/ /pubmed/21159205 http://dx.doi.org/10.1186/1471-2105-11-599 Text en Copyright ©2010 Cao et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Cao, Minh Duc
Dix, Trevor I
Allison, Lloyd
A genome alignment algorithm based on compression
title A genome alignment algorithm based on compression
title_full A genome alignment algorithm based on compression
title_fullStr A genome alignment algorithm based on compression
title_full_unstemmed A genome alignment algorithm based on compression
title_short A genome alignment algorithm based on compression
title_sort genome alignment algorithm based on compression
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3022628/
https://www.ncbi.nlm.nih.gov/pubmed/21159205
http://dx.doi.org/10.1186/1471-2105-11-599
work_keys_str_mv AT caominhduc agenomealignmentalgorithmbasedoncompression
AT dixtrevori agenomealignmentalgorithmbasedoncompression
AT allisonlloyd agenomealignmentalgorithmbasedoncompression
AT caominhduc genomealignmentalgorithmbasedoncompression
AT dixtrevori genomealignmentalgorithmbasedoncompression
AT allisonlloyd genomealignmentalgorithmbasedoncompression