Cargando…

Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data

BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alig...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cline, Eliot, Wisittipanit, Nuttachat, Boongoen, Tossapon, Chukeatirote, Ekachai, Struss, Darush, Eungwanichayapant, Anant
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2020
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727374/ https://www.ncbi.nlm.nih.gov/pubmed/33354434 http://dx.doi.org/10.7717/peerj.10501

_version_	1783621081283690496
author	Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant
author_facet	Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant
author_sort	Cline, Eliot
collection	PubMed
description	BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. MATERIALS AND METHODS: Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. RESULTS: Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. CONCLUSION: Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing.
format	Online Article Text
id	pubmed-7727374
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-77273742020-12-21 Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant PeerJ Bioinformatics BACKGROUND: Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. MATERIALS AND METHODS: Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. RESULTS: Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. CONCLUSION: Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing. PeerJ Inc. 2020-12-07 /pmc/articles/PMC7727374/ /pubmed/33354434 http://dx.doi.org/10.7717/peerj.10501 Text en © 2020 Cline et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Cline, Eliot Wisittipanit, Nuttachat Boongoen, Tossapon Chukeatirote, Ekachai Struss, Darush Eungwanichayapant, Anant Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title	Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_full	Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_fullStr	Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_full_unstemmed	Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_short	Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data
title_sort	recalibration of mapping quality scores in illumina short-read alignments improves snp detection results in low-coverage sequencing data
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727374/ https://www.ncbi.nlm.nih.gov/pubmed/33354434 http://dx.doi.org/10.7717/peerj.10501
work_keys_str_mv	AT clineeliot recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT wisittipanitnuttachat recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT boongoentossapon recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT chukeatiroteekachai recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT strussdarush recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata AT eungwanichayapantanant recalibrationofmappingqualityscoresinilluminashortreadalignmentsimprovessnpdetectionresultsinlowcoveragesequencingdata

Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data

Ejemplares similares