Cargando…

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first id...

Descripción completa

Detalles Bibliográficos
Autores principales:	Popendorf, Kris, Tsuyoshi, Hachiya, Osana, Yasunori, Sakakibara, Yasubumi
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945767/ https://www.ncbi.nlm.nih.gov/pubmed/20885980 http://dx.doi.org/10.1371/journal.pone.0012651

_version_	1782187243931172864
author	Popendorf, Kris Tsuyoshi, Hachiya Osana, Yasunori Sakakibara, Yasubumi
author_facet	Popendorf, Kris Tsuyoshi, Hachiya Osana, Yasunori Sakakibara, Yasubumi
author_sort	Popendorf, Kris
collection	PubMed
description	BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.
format	Text
id	pubmed-2945767
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-29457672010-09-30 Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes Popendorf, Kris Tsuyoshi, Hachiya Osana, Yasunori Sakakibara, Yasubumi PLoS One Research Article BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net. Public Library of Science 2010-09-24 /pmc/articles/PMC2945767/ /pubmed/20885980 http://dx.doi.org/10.1371/journal.pone.0012651 Text en Popendorf et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Popendorf, Kris Tsuyoshi, Hachiya Osana, Yasunori Sakakibara, Yasubumi Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes
title	Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes
title_full	Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes
title_fullStr	Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes
title_full_unstemmed	Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes
title_short	Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes
title_sort	murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945767/ https://www.ncbi.nlm.nih.gov/pubmed/20885980 http://dx.doi.org/10.1371/journal.pone.0012651
work_keys_str_mv	AT popendorfkris murasakiafastparallelizablealgorithmtofindanchorsfrommultiplegenomes AT tsuyoshihachiya murasakiafastparallelizablealgorithmtofindanchorsfrommultiplegenomes AT osanayasunori murasakiafastparallelizablealgorithmtofindanchorsfrommultiplegenomes AT sakakibarayasubumi murasakiafastparallelizablealgorithmtofindanchorsfrommultiplegenomes

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Ejemplares similares