Cargando…
Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alig...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727374/ https://www.ncbi.nlm.nih.gov/pubmed/33354434 http://dx.doi.org/10.7717/peerj.10501 |
_version_ | 1783621081283690496 |
---|---|
author | Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant |
author_facet | Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant |
author_sort | Cline, Eliot |
collection | PubMed |
description | BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. MATERIALS AND METHODS: Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. RESULTS: Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. CONCLUSION: Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing. |
format | Online Article Text |
id | pubmed-7727374 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-77273742020-12-21 Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant PeerJ Bioinformatics BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. MATERIALS AND METHODS: Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. RESULTS: Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. CONCLUSION: Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing. PeerJ Inc. 2020-12-07 /pmc/articles/PMC7727374/ /pubmed/33354434 http://dx.doi.org/10.7717/peerj.10501 Text en © 2020 Cline et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data |
title | Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data |
title_full | Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data |
title_fullStr | Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data |
title_full_unstemmed | Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data |
title_short | Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data |
title_sort | recalibration of mapping quality scores in illumina short-read alignments improves snp detection results in low-coverage sequencing data |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727374/ https://www.ncbi.nlm.nih.gov/pubmed/33354434 http://dx.doi.org/10.7717/peerj.10501 |
work_keys_str_mv | AT clineeliot recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT wisittipanitnuttachat recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT boongoentossapon recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT chukeatiroteekachai recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT strussdarush recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT eungwanichayapantanant recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata |