Cargando…

Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees

Motivation: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of l...

Descripción completa

Detalles Bibliográficos
Autores principales: Mahmud, Md Pavel, Wiedenhoeft, John, Schliep, Alexander
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436807/
https://www.ncbi.nlm.nih.gov/pubmed/22962448
http://dx.doi.org/10.1093/bioinformatics/bts380
_version_ 1782242702221377536
author Mahmud, Md Pavel
Wiedenhoeft, John
Schliep, Alexander
author_facet Mahmud, Md Pavel
Wiedenhoeft, John
Schliep, Alexander
author_sort Mahmud, Md Pavel
collection PubMed
description Motivation: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics. Results: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L(1) distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L(1) distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants. Availability and implementation: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http://treq.sf.net. Contact: pavelm@cs.rutgers.edu Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3436807
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-34368072012-12-12 Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees Mahmud, Md Pavel Wiedenhoeft, John Schliep, Alexander Bioinformatics Original Papers Motivation: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics. Results: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L(1) distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L(1) distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants. Availability and implementation: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http://treq.sf.net. Contact: pavelm@cs.rutgers.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2012-09-15 2012-09-03 /pmc/articles/PMC3436807/ /pubmed/22962448 http://dx.doi.org/10.1093/bioinformatics/bts380 Text en © The Author(s) (2012). Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Mahmud, Md Pavel
Wiedenhoeft, John
Schliep, Alexander
Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
title Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
title_full Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
title_fullStr Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
title_full_unstemmed Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
title_short Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
title_sort indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436807/
https://www.ncbi.nlm.nih.gov/pubmed/22962448
http://dx.doi.org/10.1093/bioinformatics/bts380
work_keys_str_mv AT mahmudmdpavel indeltolerantreadmappingwithtrinucleotidefrequenciesusingcacheobliviouskdtrees
AT wiedenhoeftjohn indeltolerantreadmappingwithtrinucleotidefrequenciesusingcacheobliviouskdtrees
AT schliepalexander indeltolerantreadmappingwithtrinucleotidefrequenciesusingcacheobliviouskdtrees