Cargando…

Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data

Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with...

Descripción completa

Detalles Bibliográficos
Autores principales: Mende, Daniel R., Waller, Alison S., Sunagawa, Shinichi, Järvelin, Aino I., Chan, Michelle M., Arumugam, Manimozhiyan, Raes, Jeroen, Bork, Peer
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3285633/
https://www.ncbi.nlm.nih.gov/pubmed/22384016
http://dx.doi.org/10.1371/journal.pone.0031386
_version_ 1782224496859545600
author Mende, Daniel R.
Waller, Alison S.
Sunagawa, Shinichi
Järvelin, Aino I.
Chan, Michelle M.
Arumugam, Manimozhiyan
Raes, Jeroen
Bork, Peer
author_facet Mende, Daniel R.
Waller, Alison S.
Sunagawa, Shinichi
Järvelin, Aino I.
Chan, Michelle M.
Arumugam, Manimozhiyan
Raes, Jeroen
Bork, Peer
author_sort Mende, Daniel R.
collection PubMed
description Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
format Online
Article
Text
id pubmed-3285633
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32856332012-03-01 Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data Mende, Daniel R. Waller, Alison S. Sunagawa, Shinichi Järvelin, Aino I. Chan, Michelle M. Arumugam, Manimozhiyan Raes, Jeroen Bork, Peer PLoS One Research Article Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available. Public Library of Science 2012-02-23 /pmc/articles/PMC3285633/ /pubmed/22384016 http://dx.doi.org/10.1371/journal.pone.0031386 Text en Mende et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Mende, Daniel R.
Waller, Alison S.
Sunagawa, Shinichi
Järvelin, Aino I.
Chan, Michelle M.
Arumugam, Manimozhiyan
Raes, Jeroen
Bork, Peer
Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data
title Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data
title_full Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data
title_fullStr Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data
title_full_unstemmed Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data
title_short Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data
title_sort assessment of metagenomic assembly using simulated next generation sequencing data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3285633/
https://www.ncbi.nlm.nih.gov/pubmed/22384016
http://dx.doi.org/10.1371/journal.pone.0031386
work_keys_str_mv AT mendedanielr assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata
AT walleralisons assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata
AT sunagawashinichi assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata
AT jarvelinainoi assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata
AT chanmichellem assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata
AT arumugammanimozhiyan assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata
AT raesjeroen assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata
AT borkpeer assessmentofmetagenomicassemblyusingsimulatednextgenerationsequencingdata