Cargando…

A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model

BACKGROUND: The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the next generation sequencing technology and greatly promoted the biological research. However, the third generation sequencing data has a high leve...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Jiaqi, Wang, Jiayin, Xiao, Xiao, Lai, Xin, Dai, Daocheng, Zhang, Xuanping, Zhu, Xiaoyan, Zhao, Zhongmeng, Wang, Juan, Li, Zhimin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7677778/
https://www.ncbi.nlm.nih.gov/pubmed/33208104
http://dx.doi.org/10.1186/s12864-020-07008-9
_version_ 1783612046671085568
author Liu, Jiaqi
Wang, Jiayin
Xiao, Xiao
Lai, Xin
Dai, Daocheng
Zhang, Xuanping
Zhu, Xiaoyan
Zhao, Zhongmeng
Wang, Juan
Li, Zhimin
author_facet Liu, Jiaqi
Wang, Jiayin
Xiao, Xiao
Lai, Xin
Dai, Daocheng
Zhang, Xuanping
Zhu, Xiaoyan
Zhao, Zhongmeng
Wang, Juan
Li, Zhimin
author_sort Liu, Jiaqi
collection PubMed
description BACKGROUND: The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the next generation sequencing technology and greatly promoted the biological research. However, the third generation sequencing data has a high level of the sequencing error rates, which inevitably affects the downstream analysis. Although the issue of sequencing error has been improving these years, large amounts of data were produced at high sequencing errors, and huge waste will be caused if they are discarded. Thus, the error correction for the third generation sequencing data is especially important. The existing error correction methods have poor performances at heterozygous sites, which are ubiquitous in diploid and polyploidy organisms. Therefore, it is a lack of error correction algorithms for the heterozygous loci, especially at low coverages. RESULTS: In this article, we propose a error correction method, named QIHC. QIHC is a hybrid correction method, which needs both the next generation and third generation sequencing data. QIHC greatly enhances the sensitivity of identifying the heterozygous sites from sequencing errors, which leads to a high accuracy on error correction. To achieve this, QIHC established a set of probabilistic models based on Bayesian classifier, to estimate the heterozygosity of a site and makes a judgment by calculating the posterior probabilities. The proposed method is consisted of three modules, which respectively generates a pseudo reference sequence, obtains the read alignments, estimates the heterozygosity the sites and corrects the read harboring them. The last module is the core module of QIHC, which is designed to fit for the calculations of multiple cases at a heterozygous site. The other two modules enable the reads mapping to the pseudo reference sequence which somehow overcomes the inefficiency of multiple mappings that adopt by the existing error correction methods. CONCLUSIONS: To verify the performance of our method, we selected Canu and Jabba to compare with QIHC in several aspects. As a hybrid correction method, we first conducted a groups of experiments under different coverages of the next-generation sequencing data. QIHC is far ahead of Jabba on accuracy. Meanwhile, we varied the coverages of the third generation sequencing data and compared performances again among Canu, Jabba and QIHC. QIHC outperforms the other two methods on accuracy of both correcting the sequencing errors and identifying the heterozygous sites, especially at low coverage. We carried out a comparison analysis between Canu and QIHC on the different error rates of the third generation sequencing data. QIHC still performs better. Therefore, QIHC is superior to the existing error correction methods when heterozygous sites exist.
format Online
Article
Text
id pubmed-7677778
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-76777782020-11-20 A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model Liu, Jiaqi Wang, Jiayin Xiao, Xiao Lai, Xin Dai, Daocheng Zhang, Xuanping Zhu, Xiaoyan Zhao, Zhongmeng Wang, Juan Li, Zhimin BMC Genomics Methodology BACKGROUND: The emergence of the third generation sequencing technology, featuring longer read lengths, has demonstrated great advancement compared to the next generation sequencing technology and greatly promoted the biological research. However, the third generation sequencing data has a high level of the sequencing error rates, which inevitably affects the downstream analysis. Although the issue of sequencing error has been improving these years, large amounts of data were produced at high sequencing errors, and huge waste will be caused if they are discarded. Thus, the error correction for the third generation sequencing data is especially important. The existing error correction methods have poor performances at heterozygous sites, which are ubiquitous in diploid and polyploidy organisms. Therefore, it is a lack of error correction algorithms for the heterozygous loci, especially at low coverages. RESULTS: In this article, we propose a error correction method, named QIHC. QIHC is a hybrid correction method, which needs both the next generation and third generation sequencing data. QIHC greatly enhances the sensitivity of identifying the heterozygous sites from sequencing errors, which leads to a high accuracy on error correction. To achieve this, QIHC established a set of probabilistic models based on Bayesian classifier, to estimate the heterozygosity of a site and makes a judgment by calculating the posterior probabilities. The proposed method is consisted of three modules, which respectively generates a pseudo reference sequence, obtains the read alignments, estimates the heterozygosity the sites and corrects the read harboring them. The last module is the core module of QIHC, which is designed to fit for the calculations of multiple cases at a heterozygous site. The other two modules enable the reads mapping to the pseudo reference sequence which somehow overcomes the inefficiency of multiple mappings that adopt by the existing error correction methods. CONCLUSIONS: To verify the performance of our method, we selected Canu and Jabba to compare with QIHC in several aspects. As a hybrid correction method, we first conducted a groups of experiments under different coverages of the next-generation sequencing data. QIHC is far ahead of Jabba on accuracy. Meanwhile, we varied the coverages of the third generation sequencing data and compared performances again among Canu, Jabba and QIHC. QIHC outperforms the other two methods on accuracy of both correcting the sequencing errors and identifying the heterozygous sites, especially at low coverage. We carried out a comparison analysis between Canu and QIHC on the different error rates of the third generation sequencing data. QIHC still performs better. Therefore, QIHC is superior to the existing error correction methods when heterozygous sites exist. BioMed Central 2020-11-18 /pmc/articles/PMC7677778/ /pubmed/33208104 http://dx.doi.org/10.1186/s12864-020-07008-9 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology
Liu, Jiaqi
Wang, Jiayin
Xiao, Xiao
Lai, Xin
Dai, Daocheng
Zhang, Xuanping
Zhu, Xiaoyan
Zhao, Zhongmeng
Wang, Juan
Li, Zhimin
A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
title A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
title_full A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
title_fullStr A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
title_full_unstemmed A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
title_short A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
title_sort hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7677778/
https://www.ncbi.nlm.nih.gov/pubmed/33208104
http://dx.doi.org/10.1186/s12864-020-07008-9
work_keys_str_mv AT liujiaqi ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT wangjiayin ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT xiaoxiao ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT laixin ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT daidaocheng ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT zhangxuanping ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT zhuxiaoyan ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT zhaozhongmeng ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT wangjuan ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT lizhimin ahybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT liujiaqi hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT wangjiayin hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT xiaoxiao hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT laixin hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT daidaocheng hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT zhangxuanping hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT zhuxiaoyan hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT zhaozhongmeng hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT wangjuan hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel
AT lizhimin hybridcorrectingmethodconsideringheterozygousvariationsbyacomprehensiveprobabilisticmodel