Cargando…

Accurate estimation of short read mapping quality for next-generation genome sequencing

Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment—in principle, this quality score tells researchers the likelihood that the alignment is correct. How...

Descripción completa

Detalles Bibliográficos
Autores principales: Ruffalo, Matthew, Koyutürk, Mehmet, Ray, Soumya, LaFramboise, Thomas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436835/
https://www.ncbi.nlm.nih.gov/pubmed/22962451
http://dx.doi.org/10.1093/bioinformatics/bts408
_version_ 1782242708725694464
author Ruffalo, Matthew
Koyutürk, Mehmet
Ray, Soumya
LaFramboise, Thomas
author_facet Ruffalo, Matthew
Koyutürk, Mehmet
Ray, Soumya
LaFramboise, Thomas
author_sort Ruffalo, Matthew
collection PubMed
description Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment—in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants. Approach: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality. Results: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can ‘resurrect’ many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms. Availability: LoQuM is available as open source at http://compbio.case.edu/loqum/. Contact: matthew.ruffalo@case.edu.
format Online
Article
Text
id pubmed-3436835
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-34368352012-12-12 Accurate estimation of short read mapping quality for next-generation genome sequencing Ruffalo, Matthew Koyutürk, Mehmet Ray, Soumya LaFramboise, Thomas Bioinformatics Original Papers Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment—in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants. Approach: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality. Results: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can ‘resurrect’ many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms. Availability: LoQuM is available as open source at http://compbio.case.edu/loqum/. Contact: matthew.ruffalo@case.edu. Oxford University Press 2012-09-15 2012-09-03 /pmc/articles/PMC3436835/ /pubmed/22962451 http://dx.doi.org/10.1093/bioinformatics/bts408 Text en © The Author(s) (2012). Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Ruffalo, Matthew
Koyutürk, Mehmet
Ray, Soumya
LaFramboise, Thomas
Accurate estimation of short read mapping quality for next-generation genome sequencing
title Accurate estimation of short read mapping quality for next-generation genome sequencing
title_full Accurate estimation of short read mapping quality for next-generation genome sequencing
title_fullStr Accurate estimation of short read mapping quality for next-generation genome sequencing
title_full_unstemmed Accurate estimation of short read mapping quality for next-generation genome sequencing
title_short Accurate estimation of short read mapping quality for next-generation genome sequencing
title_sort accurate estimation of short read mapping quality for next-generation genome sequencing
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436835/
https://www.ncbi.nlm.nih.gov/pubmed/22962451
http://dx.doi.org/10.1093/bioinformatics/bts408
work_keys_str_mv AT ruffalomatthew accurateestimationofshortreadmappingqualityfornextgenerationgenomesequencing
AT koyuturkmehmet accurateestimationofshortreadmappingqualityfornextgenerationgenomesequencing
AT raysoumya accurateestimationofshortreadmappingqualityfornextgenerationgenomesequencing
AT laframboisethomas accurateestimationofshortreadmappingqualityfornextgenerationgenomesequencing