Cargando…

Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots

Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different fro...

Descripción completa

Detalles Bibliográficos
Autores principales: Kumar, Sujai, Jones, Martin, Koutsovoulos, Georgios, Clarke, Michael, Blaxter, Mark
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/
https://www.ncbi.nlm.nih.gov/pubmed/24348509
http://dx.doi.org/10.3389/fgene.2013.00237
_version_ 1782293053735698432
author Kumar, Sujai
Jones, Martin
Koutsovoulos, Georgios
Clarke, Michael
Blaxter, Mark
author_facet Kumar, Sujai
Jones, Martin
Koutsovoulos, Georgios
Clarke, Michael
Blaxter, Mark
author_sort Kumar, Sujai
collection PubMed
description Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment – whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.
format Online
Article
Text
id pubmed-3843372
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-38433722013-12-13 Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots Kumar, Sujai Jones, Martin Koutsovoulos, Georgios Clarke, Michael Blaxter, Mark Front Genet Genetics Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment – whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer. Frontiers Media S.A. 2013-11-29 /pmc/articles/PMC3843372/ /pubmed/24348509 http://dx.doi.org/10.3389/fgene.2013.00237 Text en Copyright © 2013 Kumar, Jones, Koutsovoulos, Clarke and Blaxter. http://creativecommons.org/licenses/by/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Kumar, Sujai
Jones, Martin
Koutsovoulos, Georgios
Clarke, Michael
Blaxter, Mark
Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots
title Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots
title_full Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots
title_fullStr Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots
title_full_unstemmed Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots
title_short Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots
title_sort blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated gc-coverage plots
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/
https://www.ncbi.nlm.nih.gov/pubmed/24348509
http://dx.doi.org/10.3389/fgene.2013.00237
work_keys_str_mv AT kumarsujai blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots
AT jonesmartin blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots
AT koutsovoulosgeorgios blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots
AT clarkemichael blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots
AT blaxtermark blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots