Cargando…

Large scale sequence alignment via efficient inference in generative models

Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guaran...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mongia, Mihir, Shen, Chengze, Davoodi, Arash Gholami, Marçais, Guillaume, Mohimani, Hosein
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10160065/ https://www.ncbi.nlm.nih.gov/pubmed/37142645 http://dx.doi.org/10.1038/s41598-023-34257-x

_version_	1785037206266576896
author	Mongia, Mihir Shen, Chengze Davoodi, Arash Gholami Marçais, Guillaume Mohimani, Hosein
author_facet	Mongia, Mihir Shen, Chengze Davoodi, Arash Gholami Marçais, Guillaume Mohimani, Hosein
author_sort	Mongia, Mihir
collection	PubMed
description	Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences.
format	Online Article Text
id	pubmed-10160065
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-101600652023-05-06 Large scale sequence alignment via efficient inference in generative models Mongia, Mihir Shen, Chengze Davoodi, Arash Gholami Marçais, Guillaume Mohimani, Hosein Sci Rep Article Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences. Nature Publishing Group UK 2023-05-04 /pmc/articles/PMC10160065/ /pubmed/37142645 http://dx.doi.org/10.1038/s41598-023-34257-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Mongia, Mihir Shen, Chengze Davoodi, Arash Gholami Marçais, Guillaume Mohimani, Hosein Large scale sequence alignment via efficient inference in generative models
title	Large scale sequence alignment via efficient inference in generative models
title_full	Large scale sequence alignment via efficient inference in generative models
title_fullStr	Large scale sequence alignment via efficient inference in generative models
title_full_unstemmed	Large scale sequence alignment via efficient inference in generative models
title_short	Large scale sequence alignment via efficient inference in generative models
title_sort	large scale sequence alignment via efficient inference in generative models
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10160065/ https://www.ncbi.nlm.nih.gov/pubmed/37142645 http://dx.doi.org/10.1038/s41598-023-34257-x
work_keys_str_mv	AT mongiamihir largescalesequencealignmentviaefficientinferenceingenerativemodels AT shenchengze largescalesequencealignmentviaefficientinferenceingenerativemodels AT davoodiarashgholami largescalesequencealignmentviaefficientinferenceingenerativemodels AT marcaisguillaume largescalesequencealignmentviaefficientinferenceingenerativemodels AT mohimanihosein largescalesequencealignmentviaefficientinferenceingenerativemodels

Large scale sequence alignment via efficient inference in generative models

Ejemplares similares