Cargando…
Allele detection using k-mer-based sequencing error profiles
MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10625474/ https://www.ncbi.nlm.nih.gov/pubmed/37928341 http://dx.doi.org/10.1093/bioadv/vbad149 |
_version_ | 1785131139239772160 |
---|---|
author | Ashraf, Hufsah Ebler, Jana Marschall, Tobias |
author_facet | Ashraf, Hufsah Ebler, Jana Marschall, Tobias |
author_sort | Ashraf, Hufsah |
collection | PubMed |
description | MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. RESULTS: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3 [Formula: see text] coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. AVAILABILITY AND IMPLEMENTATION: https://github.com/whatshap/whatshap. |
format | Online Article Text |
id | pubmed-10625474 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-106254742023-11-05 Allele detection using k-mer-based sequencing error profiles Ashraf, Hufsah Ebler, Jana Marschall, Tobias Bioinform Adv Original Paper MOTIVATION: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. RESULTS: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3 [Formula: see text] coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. AVAILABILITY AND IMPLEMENTATION: https://github.com/whatshap/whatshap. Oxford University Press 2023-10-20 /pmc/articles/PMC10625474/ /pubmed/37928341 http://dx.doi.org/10.1093/bioadv/vbad149 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Ashraf, Hufsah Ebler, Jana Marschall, Tobias Allele detection using k-mer-based sequencing error profiles |
title | Allele detection using k-mer-based sequencing error profiles |
title_full | Allele detection using k-mer-based sequencing error profiles |
title_fullStr | Allele detection using k-mer-based sequencing error profiles |
title_full_unstemmed | Allele detection using k-mer-based sequencing error profiles |
title_short | Allele detection using k-mer-based sequencing error profiles |
title_sort | allele detection using k-mer-based sequencing error profiles |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10625474/ https://www.ncbi.nlm.nih.gov/pubmed/37928341 http://dx.doi.org/10.1093/bioadv/vbad149 |
work_keys_str_mv | AT ashrafhufsah alleledetectionusingkmerbasedsequencingerrorprofiles AT eblerjana alleledetectionusingkmerbasedsequencingerrorprofiles AT marschalltobias alleledetectionusingkmerbasedsequencingerrorprofiles |