Cargando…

SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles

BACKGROUND: A number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contex...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yu, Zhenhua, Du, Fang, Ban, Rongjun, Zhang, Yuanwei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7379788/ https://www.ncbi.nlm.nih.gov/pubmed/32703148 http://dx.doi.org/10.1186/s12859-020-03665-5

_version_	1783562720105201664
author	Yu, Zhenhua Du, Fang Ban, Rongjun Zhang, Yuanwei
author_facet	Yu, Zhenhua Du, Fang Ban, Rongjun Zhang, Yuanwei
author_sort	Yu, Zhenhua
collection	PubMed
description	BACKGROUND: A number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contextual information is not effectively utilized for reliably characterizing base substitution patterns, as well as the positional and contextual difference of Phred quality scores is not fully investigated. Thus, a more effective and efficient bioinformatics tool is sorely required. RESULTS: Here, we introduce a novel tool, SimuSCoP, to reliably emulate complex DNA sequencing data. The base substitution patterns and the statistical behavior of quality scores in Illumina sequencing data are fully explored and integrated into the simulation model for reliably emulating datasets for different applications. In addition, an integrated and easy-to-use pipeline is employed in SimuSCoP to facilitate end-to-end simulation of complex samples, and high runtime efficiency is achieved by implementing the tool to run in multithreading with low memory consumption. These features enable SimuSCoP to gets substantial improvements in reliability, functionality, practicality and runtime efficiency. The tool is comprehensively evaluated in multiple aspects including consistency of profiles, simulation of genomic variations and complex tumor samples, and the results demonstrate the advantages of SimuSCoP over existing tools. CONCLUSIONS: SimuSCoP, a new bioinformatics tool is developed to learn informative profiles from real sequencing data and reliably mimic complex data by introducing various genomic variations. We believe that the presented work will catalyse new development of downstream bioinformatics methods for analyzing sequencing data.
format	Online Article Text
id	pubmed-7379788
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-73797882020-08-04 SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles Yu, Zhenhua Du, Fang Ban, Rongjun Zhang, Yuanwei BMC Bioinformatics Software BACKGROUND: A number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contextual information is not effectively utilized for reliably characterizing base substitution patterns, as well as the positional and contextual difference of Phred quality scores is not fully investigated. Thus, a more effective and efficient bioinformatics tool is sorely required. RESULTS: Here, we introduce a novel tool, SimuSCoP, to reliably emulate complex DNA sequencing data. The base substitution patterns and the statistical behavior of quality scores in Illumina sequencing data are fully explored and integrated into the simulation model for reliably emulating datasets for different applications. In addition, an integrated and easy-to-use pipeline is employed in SimuSCoP to facilitate end-to-end simulation of complex samples, and high runtime efficiency is achieved by implementing the tool to run in multithreading with low memory consumption. These features enable SimuSCoP to gets substantial improvements in reliability, functionality, practicality and runtime efficiency. The tool is comprehensively evaluated in multiple aspects including consistency of profiles, simulation of genomic variations and complex tumor samples, and the results demonstrate the advantages of SimuSCoP over existing tools. CONCLUSIONS: SimuSCoP, a new bioinformatics tool is developed to learn informative profiles from real sequencing data and reliably mimic complex data by introducing various genomic variations. We believe that the presented work will catalyse new development of downstream bioinformatics methods for analyzing sequencing data. BioMed Central 2020-07-23 /pmc/articles/PMC7379788/ /pubmed/32703148 http://dx.doi.org/10.1186/s12859-020-03665-5 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Software Yu, Zhenhua Du, Fang Ban, Rongjun Zhang, Yuanwei SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles
title	SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles
title_full	SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles
title_fullStr	SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles
title_full_unstemmed	SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles
title_short	SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles
title_sort	simuscop: reliably simulate illumina sequencing data based on position and context dependent profiles
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7379788/ https://www.ncbi.nlm.nih.gov/pubmed/32703148 http://dx.doi.org/10.1186/s12859-020-03665-5
work_keys_str_mv	AT yuzhenhua simuscopreliablysimulateilluminasequencingdatabasedonpositionandcontextdependentprofiles AT dufang simuscopreliablysimulateilluminasequencingdatabasedonpositionandcontextdependentprofiles AT banrongjun simuscopreliablysimulateilluminasequencingdatabasedonpositionandcontextdependentprofiles AT zhangyuanwei simuscopreliablysimulateilluminasequencingdatabasedonpositionandcontextdependentprofiles

SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles

Ejemplares similares