Cargando…

Allele detection using k-mer-based sequencing error profiles

MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high...

Descripción completa

Detalles Bibliográficos
Autores principales: Ashraf, Hufsah, Ebler, Jana, Marschall, Tobias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10625474/
https://www.ncbi.nlm.nih.gov/pubmed/37928341
http://dx.doi.org/10.1093/bioadv/vbad149
_version_ 1785131139239772160
author Ashraf, Hufsah
Ebler, Jana
Marschall, Tobias
author_facet Ashraf, Hufsah
Ebler, Jana
Marschall, Tobias
author_sort Ashraf, Hufsah
collection PubMed
description MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. RESULTS: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3 [Formula: see text] coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. AVAILABILITY AND IMPLEMENTATION: https://github.com/whatshap/whatshap.
format Online
Article
Text
id pubmed-10625474
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-106254742023-11-05 Allele detection using k-mer-based sequencing error profiles Ashraf, Hufsah Ebler, Jana Marschall, Tobias Bioinform Adv Original Paper MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. RESULTS: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3 [Formula: see text] coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. AVAILABILITY AND IMPLEMENTATION: https://github.com/whatshap/whatshap. Oxford University Press 2023-10-20 /pmc/articles/PMC10625474/ /pubmed/37928341 http://dx.doi.org/10.1093/bioadv/vbad149 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Ashraf, Hufsah
Ebler, Jana
Marschall, Tobias
Allele detection using k-mer-based sequencing error profiles
title Allele detection using k-mer-based sequencing error profiles
title_full Allele detection using k-mer-based sequencing error profiles
title_fullStr Allele detection using k-mer-based sequencing error profiles
title_full_unstemmed Allele detection using k-mer-based sequencing error profiles
title_short Allele detection using k-mer-based sequencing error profiles
title_sort allele detection using k-mer-based sequencing error profiles
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10625474/
https://www.ncbi.nlm.nih.gov/pubmed/37928341
http://dx.doi.org/10.1093/bioadv/vbad149
work_keys_str_mv AT ashrafhufsah alleledetectionusingkmerbasedsequencingerrorprofiles
AT eblerjana alleledetectionusingkmerbasedsequencingerrorprofiles
AT marschalltobias alleledetectionusingkmerbasedsequencingerrorprofiles