Cargando…

Adaptable probabilistic mapping of short reads using position specific scoring matrices

BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some da...

Descripción completa

Detalles Bibliográficos
Autores principales: Kerpedjiev, Peter, Frellsen, Jes, Lindgreen, Stinus, Krogh, Anders
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021105/
https://www.ncbi.nlm.nih.gov/pubmed/24717095
http://dx.doi.org/10.1186/1471-2105-15-100
_version_ 1782316173182894080
author Kerpedjiev, Peter
Frellsen, Jes
Lindgreen, Stinus
Krogh, Anders
author_facet Kerpedjiev, Peter
Frellsen, Jes
Lindgreen, Stinus
Krogh, Anders
author_sort Kerpedjiev, Peter
collection PubMed
description BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases. RESULTS: We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.
format Online
Article
Text
id pubmed-4021105
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40211052014-05-28 Adaptable probabilistic mapping of short reads using position specific scoring matrices Kerpedjiev, Peter Frellsen, Jes Lindgreen, Stinus Krogh, Anders BMC Bioinformatics Research Article BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases. RESULTS: We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans). CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets. BioMed Central 2014-04-09 /pmc/articles/PMC4021105/ /pubmed/24717095 http://dx.doi.org/10.1186/1471-2105-15-100 Text en Copyright © 2014 Kerpedjiev et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Kerpedjiev, Peter
Frellsen, Jes
Lindgreen, Stinus
Krogh, Anders
Adaptable probabilistic mapping of short reads using position specific scoring matrices
title Adaptable probabilistic mapping of short reads using position specific scoring matrices
title_full Adaptable probabilistic mapping of short reads using position specific scoring matrices
title_fullStr Adaptable probabilistic mapping of short reads using position specific scoring matrices
title_full_unstemmed Adaptable probabilistic mapping of short reads using position specific scoring matrices
title_short Adaptable probabilistic mapping of short reads using position specific scoring matrices
title_sort adaptable probabilistic mapping of short reads using position specific scoring matrices
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021105/
https://www.ncbi.nlm.nih.gov/pubmed/24717095
http://dx.doi.org/10.1186/1471-2105-15-100
work_keys_str_mv AT kerpedjievpeter adaptableprobabilisticmappingofshortreadsusingpositionspecificscoringmatrices
AT frellsenjes adaptableprobabilisticmappingofshortreadsusingpositionspecificscoringmatrices
AT lindgreenstinus adaptableprobabilisticmappingofshortreadsusingpositionspecificscoringmatrices
AT kroghanders adaptableprobabilisticmappingofshortreadsusingpositionspecificscoringmatrices