Cargando…
NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
BACKGROUND: PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequen...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5964698/ https://www.ncbi.nlm.nih.gov/pubmed/29788930 http://dx.doi.org/10.1186/s12859-018-2208-0 |
_version_ | 1783325229695631360 |
---|---|
author | Wei, Ze-Gang Zhang, Shao-Wu |
author_facet | Wei, Ze-Gang Zhang, Shao-Wu |
author_sort | Wei, Ze-Gang |
collection | PubMed |
description | BACKGROUND: PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio’s continuous long reads (CLR). RESULTS: By analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data. CONCLUSION: NPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2208-0) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5964698 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-59646982018-05-24 NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model Wei, Ze-Gang Zhang, Shao-Wu BMC Bioinformatics Software BACKGROUND: PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio’s continuous long reads (CLR). RESULTS: By analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data. CONCLUSION: NPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2208-0) contains supplementary material, which is available to authorized users. BioMed Central 2018-05-22 /pmc/articles/PMC5964698/ /pubmed/29788930 http://dx.doi.org/10.1186/s12859-018-2208-0 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Wei, Ze-Gang Zhang, Shao-Wu NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title | NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_full | NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_fullStr | NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_full_unstemmed | NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_short | NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_sort | npbss: a new pacbio sequencing simulator for generating the continuous long reads with an empirical model |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5964698/ https://www.ncbi.nlm.nih.gov/pubmed/29788930 http://dx.doi.org/10.1186/s12859-018-2208-0 |
work_keys_str_mv | AT weizegang npbssanewpacbiosequencingsimulatorforgeneratingthecontinuouslongreadswithanempiricalmodel AT zhangshaowu npbssanewpacbiosequencingsimulatorforgeneratingthecontinuouslongreadswithanempiricalmodel |