Cargando…

A better sequence-read simulator program for metagenomics

BACKGROUND: There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow...

Descripción completa

Detalles Bibliográficos
Autores principales:	Johnson, Stephen, Trost, Brett, Long, Jeffrey R, Pittet, Vanessa, Kusalik, Anthony
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168713/ https://www.ncbi.nlm.nih.gov/pubmed/25253095 http://dx.doi.org/10.1186/1471-2105-15-S9-S14

_version_	1782335604606894080
author	Johnson, Stephen Trost, Brett Long, Jeffrey R Pittet, Vanessa Kusalik, Anthony
author_facet	Johnson, Stephen Trost, Brett Long, Jeffrey R Pittet, Vanessa Kusalik, Anthony
author_sort	Johnson, Stephen
collection	PubMed
description	BACKGROUND: There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data. RESULTS: We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task. CONCLUSIONS: BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.
format	Online Article Text
id	pubmed-4168713
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41687132014-10-02 A better sequence-read simulator program for metagenomics Johnson, Stephen Trost, Brett Long, Jeffrey R Pittet, Vanessa Kusalik, Anthony BMC Bioinformatics Proceedings BACKGROUND: There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data. RESULTS: We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task. CONCLUSIONS: BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work. BioMed Central 2014-09-10 /pmc/articles/PMC4168713/ /pubmed/25253095 http://dx.doi.org/10.1186/1471-2105-15-S9-S14 Text en Copyright © 2014 Johnson et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Johnson, Stephen Trost, Brett Long, Jeffrey R Pittet, Vanessa Kusalik, Anthony A better sequence-read simulator program for metagenomics
title	A better sequence-read simulator program for metagenomics
title_full	A better sequence-read simulator program for metagenomics
title_fullStr	A better sequence-read simulator program for metagenomics
title_full_unstemmed	A better sequence-read simulator program for metagenomics
title_short	A better sequence-read simulator program for metagenomics
title_sort	better sequence-read simulator program for metagenomics
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168713/ https://www.ncbi.nlm.nih.gov/pubmed/25253095 http://dx.doi.org/10.1186/1471-2105-15-S9-S14
work_keys_str_mv	AT johnsonstephen abettersequencereadsimulatorprogramformetagenomics AT trostbrett abettersequencereadsimulatorprogramformetagenomics AT longjeffreyr abettersequencereadsimulatorprogramformetagenomics AT pittetvanessa abettersequencereadsimulatorprogramformetagenomics AT kusalikanthony abettersequencereadsimulatorprogramformetagenomics AT johnsonstephen bettersequencereadsimulatorprogramformetagenomics AT trostbrett bettersequencereadsimulatorprogramformetagenomics AT longjeffreyr bettersequencereadsimulatorprogramformetagenomics AT pittetvanessa bettersequencereadsimulatorprogramformetagenomics AT kusalikanthony bettersequencereadsimulatorprogramformetagenomics

A better sequence-read simulator program for metagenomics

Ejemplares similares