Cargando…

Graph mining for next generation sequencing: leveraging the assembly graph for biological insights

BACKGROUND: The assembly of Next Generation Sequencing (NGS) reads remains a challenging task. This is especially true for the assembly of metagenomics data that originate from environmental samples potentially containing hundreds to thousands of unique species. The principle objective of current as...

Descripción completa

Detalles Bibliográficos
Autores principales: Warnke-Sommer, Julia, Ali, Hesham
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4859950/
https://www.ncbi.nlm.nih.gov/pubmed/27154001
http://dx.doi.org/10.1186/s12864-016-2678-2
_version_ 1782431012279549952
author Warnke-Sommer, Julia
Ali, Hesham
author_facet Warnke-Sommer, Julia
Ali, Hesham
author_sort Warnke-Sommer, Julia
collection PubMed
description BACKGROUND: The assembly of Next Generation Sequencing (NGS) reads remains a challenging task. This is especially true for the assembly of metagenomics data that originate from environmental samples potentially containing hundreds to thousands of unique species. The principle objective of current assembly tools is to assemble NGS reads into contiguous stretches of sequence called contigs while maximizing for both accuracy and contig length. The end goal of this process is to produce longer contigs with the major focus being on assembly only. Sequence read assembly is an aggregative process, during which read overlap relationship information is lost as reads are merged into longer sequences or contigs. The assembly graph is information rich and capable of capturing the genomic architecture of an input read data set. We have developed a novel hybrid graph in which nodes represent sequence regions at different levels of granularity. This model, utilized in the assembly and analysis pipeline Focus, presents a concise yet feature rich view of a given input data set, allowing for the extraction of biologically relevant graph structures for graph mining purposes. RESULTS: Focus was used to create hybrid graphs to model metagenomics data sets obtained from the gut microbiomes of five individuals with Crohn’s disease and eight healthy individuals. Repetitive and mobile genetic elements are found to be associated with hybrid graph structure. Using graph mining techniques, a comparative study of the Crohn’s disease and healthy data sets was conducted with focus on antibiotics resistance genes associated with transposase genes. Results demonstrated significant differences in the phylogenetic distribution of categories of antibiotics resistance genes in the healthy and diseased patients. Focus was also evaluated as a pure assembly tool and produced excellent results when compared against the Meta-velvet, Omega, and UD-IDBA assemblers. CONCLUSIONS: Mining the hybrid graph can reveal biological phenomena captured by its structure. We demonstrate the advantages of considering assembly graphs as data-mining support in addition to their role as frameworks for assembly. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-2678-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4859950
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-48599502016-05-08 Graph mining for next generation sequencing: leveraging the assembly graph for biological insights Warnke-Sommer, Julia Ali, Hesham BMC Genomics Research Article BACKGROUND: The assembly of Next Generation Sequencing (NGS) reads remains a challenging task. This is especially true for the assembly of metagenomics data that originate from environmental samples potentially containing hundreds to thousands of unique species. The principle objective of current assembly tools is to assemble NGS reads into contiguous stretches of sequence called contigs while maximizing for both accuracy and contig length. The end goal of this process is to produce longer contigs with the major focus being on assembly only. Sequence read assembly is an aggregative process, during which read overlap relationship information is lost as reads are merged into longer sequences or contigs. The assembly graph is information rich and capable of capturing the genomic architecture of an input read data set. We have developed a novel hybrid graph in which nodes represent sequence regions at different levels of granularity. This model, utilized in the assembly and analysis pipeline Focus, presents a concise yet feature rich view of a given input data set, allowing for the extraction of biologically relevant graph structures for graph mining purposes. RESULTS: Focus was used to create hybrid graphs to model metagenomics data sets obtained from the gut microbiomes of five individuals with Crohn’s disease and eight healthy individuals. Repetitive and mobile genetic elements are found to be associated with hybrid graph structure. Using graph mining techniques, a comparative study of the Crohn’s disease and healthy data sets was conducted with focus on antibiotics resistance genes associated with transposase genes. Results demonstrated significant differences in the phylogenetic distribution of categories of antibiotics resistance genes in the healthy and diseased patients. Focus was also evaluated as a pure assembly tool and produced excellent results when compared against the Meta-velvet, Omega, and UD-IDBA assemblers. CONCLUSIONS: Mining the hybrid graph can reveal biological phenomena captured by its structure. We demonstrate the advantages of considering assembly graphs as data-mining support in addition to their role as frameworks for assembly. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-2678-2) contains supplementary material, which is available to authorized users. BioMed Central 2016-05-06 /pmc/articles/PMC4859950/ /pubmed/27154001 http://dx.doi.org/10.1186/s12864-016-2678-2 Text en © Warnke-Sommer and Ali. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Warnke-Sommer, Julia
Ali, Hesham
Graph mining for next generation sequencing: leveraging the assembly graph for biological insights
title Graph mining for next generation sequencing: leveraging the assembly graph for biological insights
title_full Graph mining for next generation sequencing: leveraging the assembly graph for biological insights
title_fullStr Graph mining for next generation sequencing: leveraging the assembly graph for biological insights
title_full_unstemmed Graph mining for next generation sequencing: leveraging the assembly graph for biological insights
title_short Graph mining for next generation sequencing: leveraging the assembly graph for biological insights
title_sort graph mining for next generation sequencing: leveraging the assembly graph for biological insights
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4859950/
https://www.ncbi.nlm.nih.gov/pubmed/27154001
http://dx.doi.org/10.1186/s12864-016-2678-2
work_keys_str_mv AT warnkesommerjulia graphminingfornextgenerationsequencingleveragingtheassemblygraphforbiologicalinsights
AT alihesham graphminingfornextgenerationsequencingleveragingtheassemblygraphforbiologicalinsights