Cargando…

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame

The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matche...

Descripción completa

Detalles Bibliográficos
Autores principales:	Filion, Guillaume J., Cortini, Ruggero, Zorita, Eduard
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2020
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7331467/ https://www.ncbi.nlm.nih.gov/pubmed/32670351 http://dx.doi.org/10.3389/fgene.2020.00572

_version_	1783553333272772608
author	Filion, Guillaume J. Cortini, Ruggero Zorita, Eduard
author_facet	Filion, Guillaume J. Cortini, Ruggero Zorita, Eduard
author_sort	Filion, Guillaume J.
collection	PubMed
description	The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame.
format	Online Article Text
id	pubmed-7331467
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-73314672020-07-14 Calibrating Seed-Based Heuristics to Map Short Reads With Sesame Filion, Guillaume J. Cortini, Ruggero Zorita, Eduard Front Genet Genetics The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame. Frontiers Media S.A. 2020-06-25 /pmc/articles/PMC7331467/ /pubmed/32670351 http://dx.doi.org/10.3389/fgene.2020.00572 Text en Copyright © 2020 Filion, Cortini and Zorita. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Filion, Guillaume J. Cortini, Ruggero Zorita, Eduard Calibrating Seed-Based Heuristics to Map Short Reads With Sesame
title	Calibrating Seed-Based Heuristics to Map Short Reads With Sesame
title_full	Calibrating Seed-Based Heuristics to Map Short Reads With Sesame
title_fullStr	Calibrating Seed-Based Heuristics to Map Short Reads With Sesame
title_full_unstemmed	Calibrating Seed-Based Heuristics to Map Short Reads With Sesame
title_short	Calibrating Seed-Based Heuristics to Map Short Reads With Sesame
title_sort	calibrating seed-based heuristics to map short reads with sesame
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7331467/ https://www.ncbi.nlm.nih.gov/pubmed/32670351 http://dx.doi.org/10.3389/fgene.2020.00572
work_keys_str_mv	AT filionguillaumej calibratingseedbasedheuristicstomapshortreadswithsesame AT cortiniruggero calibratingseedbasedheuristicstomapshortreadswithsesame AT zoritaeduard calibratingseedbasedheuristicstomapshortreadswithsesame

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame

Ejemplares similares