Cargando…

Allele detection using k-mer-based sequencing error profiles

MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high...

Descripción completa

Detalles Bibliográficos
Autores principales: Ashraf, Hufsah, Ebler, Jana, Marschall, Tobias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10625474/
https://www.ncbi.nlm.nih.gov/pubmed/37928341
http://dx.doi.org/10.1093/bioadv/vbad149
Descripción
Sumario:MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. RESULTS: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3 [Formula: see text] coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. AVAILABILITY AND IMPLEMENTATION: https://github.com/whatshap/whatshap.