Cargando…

Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data

BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alig...

Descripción completa

Detalles Bibliográficos
Autores principales: Cline, Eliot, Wisittipanit, Nuttachat, Boongoen, Tossapon, Chukeatirote, Ekachai, Struss, Darush, Eungwanichayapant, Anant
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727374/
https://www.ncbi.nlm.nih.gov/pubmed/33354434
http://dx.doi.org/10.7717/peerj.10501
_version_ 1783621081283690496
author Cline, Eliot
Wisittipanit, Nuttachat
Boongoen, Tossapon
Chukeatirote, Ekachai
Struss, Darush
Eungwanichayapant, Anant
author_facet Cline, Eliot
Wisittipanit, Nuttachat
Boongoen, Tossapon
Chukeatirote, Ekachai
Struss, Darush
Eungwanichayapant, Anant
author_sort Cline, Eliot
collection PubMed
description BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. MATERIALS AND METHODS: Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. RESULTS: Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. CONCLUSION: Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing.
format Online
Article
Text
id pubmed-7727374
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-77273742020-12-21 Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant PeerJ Bioinformatics BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. MATERIALS AND METHODS: Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. RESULTS: Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. CONCLUSION: Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing. PeerJ Inc. 2020-12-07 /pmc/articles/PMC7727374/ /pubmed/33354434 http://dx.doi.org/10.7717/peerj.10501 Text en © 2020 Cline et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Cline, Eliot
Wisittipanit, Nuttachat
Boongoen, Tossapon
Chukeatirote, Ekachai
Struss, Darush
Eungwanichayapant, Anant
Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_full Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_fullStr Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_full_unstemmed Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_short Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_sort recalibration of mapping quality scores in illumina short-read alignments improves snp detection results in low-coverage sequencing data
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727374/
https://www.ncbi.nlm.nih.gov/pubmed/33354434
http://dx.doi.org/10.7717/peerj.10501
work_keys_str_mv AT clineeliot recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata
AT wisittipanitnuttachat recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata
AT boongoentossapon recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata
AT chukeatiroteekachai recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata
AT strussdarush recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata
AT eungwanichayapantanant recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata