Cargando…

Enhancing genome assemblies by integrating non-sequence based data

INTRODUCTION: Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial am...

Descripción completa

Detalles Bibliográficos
Autores principales: Heider, Thomas N, Lindsay, James, Wang, Chenwei, O’Neill, Rachel J, Pask, Andrew J
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3090765/
https://www.ncbi.nlm.nih.gov/pubmed/21554765
http://dx.doi.org/10.1186/1753-6561-5-S2-S7
_version_ 1782203176368209920
author Heider, Thomas N
Lindsay, James
Wang, Chenwei
O’Neill, Rachel J
Pask, Andrew J
author_facet Heider, Thomas N
Lindsay, James
Wang, Chenwei
O’Neill, Rachel J
Pask, Andrew J
author_sort Heider, Thomas N
collection PubMed
description INTRODUCTION: Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies. METHODS: The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome. RESULTS: Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated into a scaffold (55% total) but a 35% increase in N50 scaffold size over the use of sequence-based data alone. CONCLUSIONS: We provide a relatively simple pipeline utilizing existing bioinformatics tools to integrate map data into a genome assembly which is available at http://www.mcb.uconn.edu/fac.php?name=paska. While the map data only contributed minimally to assigning the initial contigs to scaffolds in the new assembly, it greatly increased the N50 size. This process added structure to our low coverage assembly, greatly increasing its utility in further analyses.
format Text
id pubmed-3090765
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30907652011-05-28 Enhancing genome assemblies by integrating non-sequence based data Heider, Thomas N Lindsay, James Wang, Chenwei O’Neill, Rachel J Pask, Andrew J BMC Proc Proceedings INTRODUCTION: Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies. METHODS: The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome. RESULTS: Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated into a scaffold (55% total) but a 35% increase in N50 scaffold size over the use of sequence-based data alone. CONCLUSIONS: We provide a relatively simple pipeline utilizing existing bioinformatics tools to integrate map data into a genome assembly which is available at http://www.mcb.uconn.edu/fac.php?name=paska. While the map data only contributed minimally to assigning the initial contigs to scaffolds in the new assembly, it greatly increased the N50 size. This process added structure to our low coverage assembly, greatly increasing its utility in further analyses. BioMed Central 2011-05-28 /pmc/articles/PMC3090765/ /pubmed/21554765 http://dx.doi.org/10.1186/1753-6561-5-S2-S7 Text en Copyright ©2011 Heider et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Heider, Thomas N
Lindsay, James
Wang, Chenwei
O’Neill, Rachel J
Pask, Andrew J
Enhancing genome assemblies by integrating non-sequence based data
title Enhancing genome assemblies by integrating non-sequence based data
title_full Enhancing genome assemblies by integrating non-sequence based data
title_fullStr Enhancing genome assemblies by integrating non-sequence based data
title_full_unstemmed Enhancing genome assemblies by integrating non-sequence based data
title_short Enhancing genome assemblies by integrating non-sequence based data
title_sort enhancing genome assemblies by integrating non-sequence based data
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3090765/
https://www.ncbi.nlm.nih.gov/pubmed/21554765
http://dx.doi.org/10.1186/1753-6561-5-S2-S7
work_keys_str_mv AT heiderthomasn enhancinggenomeassembliesbyintegratingnonsequencebaseddata
AT lindsayjames enhancinggenomeassembliesbyintegratingnonsequencebaseddata
AT wangchenwei enhancinggenomeassembliesbyintegratingnonsequencebaseddata
AT oneillrachelj enhancinggenomeassembliesbyintegratingnonsequencebaseddata
AT paskandrewj enhancinggenomeassembliesbyintegratingnonsequencebaseddata