Cargando…

Allele detection using k-mer-based sequencing error profiles

MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ashraf, Hufsah, Ebler, Jana, Marschall, Tobias
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10625474/ https://www.ncbi.nlm.nih.gov/pubmed/37928341 http://dx.doi.org/10.1093/bioadv/vbad149

_version_	1785131139239772160
author	Ashraf, Hufsah Ebler, Jana Marschall, Tobias
author_facet	Ashraf, Hufsah Ebler, Jana Marschall, Tobias
author_sort	Ashraf, Hufsah
collection	PubMed
description	MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. RESULTS: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3 [Formula: see text] coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. AVAILABILITY AND IMPLEMENTATION: https://github.com/whatshap/whatshap.
format	Online Article Text
id	pubmed-10625474
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-106254742023-11-05 Allele detection using k-mer-based sequencing error profiles Ashraf, Hufsah Ebler, Jana Marschall, Tobias Bioinform Adv Original Paper MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. RESULTS: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3 [Formula: see text] coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. AVAILABILITY AND IMPLEMENTATION: https://github.com/whatshap/whatshap. Oxford University Press 2023-10-20 /pmc/articles/PMC10625474/ /pubmed/37928341 http://dx.doi.org/10.1093/bioadv/vbad149 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Ashraf, Hufsah Ebler, Jana Marschall, Tobias Allele detection using k-mer-based sequencing error profiles
title	Allele detection using k-mer-based sequencing error profiles
title_full	Allele detection using k-mer-based sequencing error profiles
title_fullStr	Allele detection using k-mer-based sequencing error profiles
title_full_unstemmed	Allele detection using k-mer-based sequencing error profiles
title_short	Allele detection using k-mer-based sequencing error profiles
title_sort	allele detection using k-mer-based sequencing error profiles
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10625474/ https://www.ncbi.nlm.nih.gov/pubmed/37928341 http://dx.doi.org/10.1093/bioadv/vbad149
work_keys_str_mv	AT ashrafhufsah alleledetectionusingkmerbasedsequencingerrorprofiles AT eblerjana alleledetectionusingkmerbasedsequencingerrorprofiles AT marschalltobias alleledetectionusingkmerbasedsequencingerrorprofiles

Allele detection using k-mer-based sequencing error profiles

Ejemplares similares