Cargando…

Enhancing genome assemblies by integrating non-sequence based data

INTRODUCTION: Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial am...

Descripción completa

Detalles Bibliográficos
Autores principales:	Heider, Thomas N, Lindsay, James, Wang, Chenwei, O’Neill, Rachel J, Pask, Andrew J
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3090765/ https://www.ncbi.nlm.nih.gov/pubmed/21554765 http://dx.doi.org/10.1186/1753-6561-5-S2-S7

_version_	1782203176368209920
author	Heider, Thomas N Lindsay, James Wang, Chenwei O’Neill, Rachel J Pask, Andrew J
author_facet	Heider, Thomas N Lindsay, James Wang, Chenwei O’Neill, Rachel J Pask, Andrew J
author_sort	Heider, Thomas N
collection	PubMed
description	INTRODUCTION: Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies. METHODS: The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome. RESULTS: Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated into a scaffold (55% total) but a 35% increase in N50 scaffold size over the use of sequence-based data alone. CONCLUSIONS: We provide a relatively simple pipeline utilizing existing bioinformatics tools to integrate map data into a genome assembly which is available at http://www.mcb.uconn.edu/fac.php?name=paska. While the map data only contributed minimally to assigning the initial contigs to scaffolds in the new assembly, it greatly increased the N50 size. This process added structure to our low coverage assembly, greatly increasing its utility in further analyses.
format	Text
id	pubmed-3090765
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30907652011-05-28 Enhancing genome assemblies by integrating non-sequence based data Heider, Thomas N Lindsay, James Wang, Chenwei O’Neill, Rachel J Pask, Andrew J BMC Proc Proceedings INTRODUCTION: Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies. METHODS: The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome. RESULTS: Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated into a scaffold (55% total) but a 35% increase in N50 scaffold size over the use of sequence-based data alone. CONCLUSIONS: We provide a relatively simple pipeline utilizing existing bioinformatics tools to integrate map data into a genome assembly which is available at http://www.mcb.uconn.edu/fac.php?name=paska. While the map data only contributed minimally to assigning the initial contigs to scaffolds in the new assembly, it greatly increased the N50 size. This process added structure to our low coverage assembly, greatly increasing its utility in further analyses. BioMed Central 2011-05-28 /pmc/articles/PMC3090765/ /pubmed/21554765 http://dx.doi.org/10.1186/1753-6561-5-S2-S7 Text en Copyright ©2011 Heider et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Heider, Thomas N Lindsay, James Wang, Chenwei O’Neill, Rachel J Pask, Andrew J Enhancing genome assemblies by integrating non-sequence based data
title	Enhancing genome assemblies by integrating non-sequence based data
title_full	Enhancing genome assemblies by integrating non-sequence based data
title_fullStr	Enhancing genome assemblies by integrating non-sequence based data
title_full_unstemmed	Enhancing genome assemblies by integrating non-sequence based data
title_short	Enhancing genome assemblies by integrating non-sequence based data
title_sort	enhancing genome assemblies by integrating non-sequence based data
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3090765/ https://www.ncbi.nlm.nih.gov/pubmed/21554765 http://dx.doi.org/10.1186/1753-6561-5-S2-S7
work_keys_str_mv	AT heiderthomasn enhancinggenomeassembliesbyintegratingnonsequencebaseddata AT lindsayjames enhancinggenomeassembliesbyintegratingnonsequencebaseddata AT wangchenwei enhancinggenomeassembliesbyintegratingnonsequencebaseddata AT oneillrachelj enhancinggenomeassembliesbyintegratingnonsequencebaseddata AT paskandrewj enhancinggenomeassembliesbyintegratingnonsequencebaseddata

Enhancing genome assemblies by integrating non-sequence based data

Ejemplares similares