Cargando…

FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets

BACKGROUND: High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algo...

Descripción completa

Detalles Bibliográficos
Autor principal:	Shcherbina, Anna
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246604/ https://www.ncbi.nlm.nih.gov/pubmed/25123167 http://dx.doi.org/10.1186/1756-0500-7-533

_version_	1782346549027667968
author	Shcherbina, Anna
author_facet	Shcherbina, Anna
author_sort	Shcherbina, Anna
collection	PubMed
description	BACKGROUND: High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible. RESULTS: FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step. CONCLUSIONS: FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1756-0500-7-533) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4246604
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42466042014-11-29 FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets Shcherbina, Anna BMC Res Notes Research Article BACKGROUND: High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible. RESULTS: FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step. CONCLUSIONS: FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1756-0500-7-533) contains supplementary material, which is available to authorized users. BioMed Central 2014-08-15 /pmc/articles/PMC4246604/ /pubmed/25123167 http://dx.doi.org/10.1186/1756-0500-7-533 Text en © Shcherbina; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Shcherbina, Anna FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets
title	FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets
title_full	FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets
title_fullStr	FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets
title_full_unstemmed	FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets
title_short	FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets
title_sort	fastqsim: platform-independent data characterization and in silico read generation for ngs datasets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246604/ https://www.ncbi.nlm.nih.gov/pubmed/25123167 http://dx.doi.org/10.1186/1756-0500-7-533
work_keys_str_mv	AT shcherbinaanna fastqsimplatformindependentdatacharacterizationandinsilicoreadgenerationforngsdatasets

FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets

Ejemplares similares