Cargando…

DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors

Delins, as known as complex indel, is a combined genomic structural variation formed by deleting and inserting DNA fragments at a common genomic location. Recent studies emphasized the importance of delins in cancer diagnosis and treatment. Although the long reads from PacBio CLR sequencing signific...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Shenjie, Zhang, Xuanping, Qiang, Geng, Wang, Jiayin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9858578/
https://www.ncbi.nlm.nih.gov/pubmed/36672745
http://dx.doi.org/10.3390/genes14010004
_version_ 1784874136749735936
author Wang, Shenjie
Zhang, Xuanping
Qiang, Geng
Wang, Jiayin
author_facet Wang, Shenjie
Zhang, Xuanping
Qiang, Geng
Wang, Jiayin
author_sort Wang, Shenjie
collection PubMed
description Delins, as known as complex indel, is a combined genomic structural variation formed by deleting and inserting DNA fragments at a common genomic location. Recent studies emphasized the importance of delins in cancer diagnosis and treatment. Although the long reads from PacBio CLR sequencing significantly facilitate delins calling, the existing approaches still encounter computational challenges from the high level of sequencing errors, and often introduce errors in genotyping and phasing delins. In this paper, we propose an efficient algorithmic pipeline, named delInsCaller, to identify delins on haplotype resolution from the PacBio CLR sequencing data. delInsCaller design a fault-tolerant method by calculating a variation density score, which helps to locate the candidate mutational regions under a high-level of sequencing errors. It adopts a base association-based contig splicing method, which facilitates contig splicing in the presence of false-positive interference. We conducted a series of experiments on simulated datasets, and the results showed that delInsCaller outperformed several state-of-the-art approaches, e.g., SVseq3, across a wide range of parameter settings, such as read depth, sequencing error rates, etc. delInsCaller often obtained higher f-measures than other approaches; specifically, it was able to maintain advantages at ~15% sequencing errors. delInsCaller was able to significantly improve the N50 values with almost no loss of haplotype accuracy compared with the existing approach as well.
format Online
Article
Text
id pubmed-9858578
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98585782023-01-21 DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors Wang, Shenjie Zhang, Xuanping Qiang, Geng Wang, Jiayin Genes (Basel) Article Delins, as known as complex indel, is a combined genomic structural variation formed by deleting and inserting DNA fragments at a common genomic location. Recent studies emphasized the importance of delins in cancer diagnosis and treatment. Although the long reads from PacBio CLR sequencing significantly facilitate delins calling, the existing approaches still encounter computational challenges from the high level of sequencing errors, and often introduce errors in genotyping and phasing delins. In this paper, we propose an efficient algorithmic pipeline, named delInsCaller, to identify delins on haplotype resolution from the PacBio CLR sequencing data. delInsCaller design a fault-tolerant method by calculating a variation density score, which helps to locate the candidate mutational regions under a high-level of sequencing errors. It adopts a base association-based contig splicing method, which facilitates contig splicing in the presence of false-positive interference. We conducted a series of experiments on simulated datasets, and the results showed that delInsCaller outperformed several state-of-the-art approaches, e.g., SVseq3, across a wide range of parameter settings, such as read depth, sequencing error rates, etc. delInsCaller often obtained higher f-measures than other approaches; specifically, it was able to maintain advantages at ~15% sequencing errors. delInsCaller was able to significantly improve the N50 values with almost no loss of haplotype accuracy compared with the existing approach as well. MDPI 2022-12-20 /pmc/articles/PMC9858578/ /pubmed/36672745 http://dx.doi.org/10.3390/genes14010004 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Wang, Shenjie
Zhang, Xuanping
Qiang, Geng
Wang, Jiayin
DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors
title DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors
title_full DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors
title_fullStr DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors
title_full_unstemmed DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors
title_short DelInsCaller: An Efficient Algorithm for Identifying Delins and Estimating Haplotypes from Long Reads with High Level of Sequencing Errors
title_sort delinscaller: an efficient algorithm for identifying delins and estimating haplotypes from long reads with high level of sequencing errors
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9858578/
https://www.ncbi.nlm.nih.gov/pubmed/36672745
http://dx.doi.org/10.3390/genes14010004
work_keys_str_mv AT wangshenjie delinscalleranefficientalgorithmforidentifyingdelinsandestimatinghaplotypesfromlongreadswithhighlevelofsequencingerrors
AT zhangxuanping delinscalleranefficientalgorithmforidentifyingdelinsandestimatinghaplotypesfromlongreadswithhighlevelofsequencingerrors
AT qianggeng delinscalleranefficientalgorithmforidentifyingdelinsandestimatinghaplotypesfromlongreadswithhighlevelofsequencingerrors
AT wangjiayin delinscalleranefficientalgorithmforidentifyingdelinsandestimatinghaplotypesfromlongreadswithhighlevelofsequencingerrors