Cargando…
Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots
Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different fro...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/ https://www.ncbi.nlm.nih.gov/pubmed/24348509 http://dx.doi.org/10.3389/fgene.2013.00237 |
_version_ | 1782293053735698432 |
---|---|
author | Kumar, Sujai Jones, Martin Koutsovoulos, Georgios Clarke, Michael Blaxter, Mark |
author_facet | Kumar, Sujai Jones, Martin Koutsovoulos, Georgios Clarke, Michael Blaxter, Mark |
author_sort | Kumar, Sujai |
collection | PubMed |
description | Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment – whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer. |
format | Online Article Text |
id | pubmed-3843372 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-38433722013-12-13 Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots Kumar, Sujai Jones, Martin Koutsovoulos, Georgios Clarke, Michael Blaxter, Mark Front Genet Genetics Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment – whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer. Frontiers Media S.A. 2013-11-29 /pmc/articles/PMC3843372/ /pubmed/24348509 http://dx.doi.org/10.3389/fgene.2013.00237 Text en Copyright © 2013 Kumar, Jones, Koutsovoulos, Clarke and Blaxter. http://creativecommons.org/licenses/by/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Genetics Kumar, Sujai Jones, Martin Koutsovoulos, Georgios Clarke, Michael Blaxter, Mark Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots |
title | Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots |
title_full | Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots |
title_fullStr | Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots |
title_full_unstemmed | Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots |
title_short | Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots |
title_sort | blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated gc-coverage plots |
topic | Genetics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843372/ https://www.ncbi.nlm.nih.gov/pubmed/24348509 http://dx.doi.org/10.3389/fgene.2013.00237 |
work_keys_str_mv | AT kumarsujai blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots AT jonesmartin blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots AT koutsovoulosgeorgios blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots AT clarkemichael blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots AT blaxtermark blobologyexploringrawgenomedataforcontaminantssymbiontsandparasitesusingtaxonannotatedgccoverageplots |