Cargando…

NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model

BACKGROUND: PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequen...

Descripción completa

Detalles Bibliográficos
Autores principales: Wei, Ze-Gang, Zhang, Shao-Wu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5964698/
https://www.ncbi.nlm.nih.gov/pubmed/29788930
http://dx.doi.org/10.1186/s12859-018-2208-0
_version_ 1783325229695631360
author Wei, Ze-Gang
Zhang, Shao-Wu
author_facet Wei, Ze-Gang
Zhang, Shao-Wu
author_sort Wei, Ze-Gang
collection PubMed
description BACKGROUND: PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio’s continuous long reads (CLR). RESULTS: By analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data. CONCLUSION: NPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2208-0) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5964698
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-59646982018-05-24 NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model Wei, Ze-Gang Zhang, Shao-Wu BMC Bioinformatics Software BACKGROUND: PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio’s continuous long reads (CLR). RESULTS: By analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data. CONCLUSION: NPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2208-0) contains supplementary material, which is available to authorized users. BioMed Central 2018-05-22 /pmc/articles/PMC5964698/ /pubmed/29788930 http://dx.doi.org/10.1186/s12859-018-2208-0 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Wei, Ze-Gang
Zhang, Shao-Wu
NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
title NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
title_full NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
title_fullStr NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
title_full_unstemmed NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
title_short NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
title_sort npbss: a new pacbio sequencing simulator for generating the continuous long reads with an empirical model
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5964698/
https://www.ncbi.nlm.nih.gov/pubmed/29788930
http://dx.doi.org/10.1186/s12859-018-2208-0
work_keys_str_mv AT weizegang npbssanewpacbiosequencingsimulatorforgeneratingthecontinuouslongreadswithanempiricalmodel
AT zhangshaowu npbssanewpacbiosequencingsimulatorforgeneratingthecontinuouslongreadswithanempiricalmodel