Cargando…

GemSIM: general, error-model based simulator of next-generation sequencing data

BACKGROUND: GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirical...

Descripción completa

Detalles Bibliográficos
Autores principales: McElroy, Kerensa E, Luciani, Fabio, Thomas, Torsten
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305602/
https://www.ncbi.nlm.nih.gov/pubmed/22336055
http://dx.doi.org/10.1186/1471-2164-13-74
_version_ 1782227106060566528
author McElroy, Kerensa E
Luciani, Fabio
Thomas, Torsten
author_facet McElroy, Kerensa E
Luciani, Fabio
Thomas, Torsten
author_sort McElroy, Kerensa E
collection PubMed
description BACKGROUND: GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects. RESULTS: We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read. The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs. CONCLUSIONS: Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis.
format Online
Article
Text
id pubmed-3305602
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33056022012-03-16 GemSIM: general, error-model based simulator of next-generation sequencing data McElroy, Kerensa E Luciani, Fabio Thomas, Torsten BMC Genomics Software BACKGROUND: GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects. RESULTS: We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read. The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs. CONCLUSIONS: Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis. BioMed Central 2012-02-15 /pmc/articles/PMC3305602/ /pubmed/22336055 http://dx.doi.org/10.1186/1471-2164-13-74 Text en Copyright ©2012 McElroy et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
McElroy, Kerensa E
Luciani, Fabio
Thomas, Torsten
GemSIM: general, error-model based simulator of next-generation sequencing data
title GemSIM: general, error-model based simulator of next-generation sequencing data
title_full GemSIM: general, error-model based simulator of next-generation sequencing data
title_fullStr GemSIM: general, error-model based simulator of next-generation sequencing data
title_full_unstemmed GemSIM: general, error-model based simulator of next-generation sequencing data
title_short GemSIM: general, error-model based simulator of next-generation sequencing data
title_sort gemsim: general, error-model based simulator of next-generation sequencing data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305602/
https://www.ncbi.nlm.nih.gov/pubmed/22336055
http://dx.doi.org/10.1186/1471-2164-13-74
work_keys_str_mv AT mcelroykerensae gemsimgeneralerrormodelbasedsimulatorofnextgenerationsequencingdata
AT lucianifabio gemsimgeneralerrormodelbasedsimulatorofnextgenerationsequencingdata
AT thomastorsten gemsimgeneralerrormodelbasedsimulatorofnextgenerationsequencingdata