Cargando…
Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments
Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter p...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660437/ https://www.ncbi.nlm.nih.gov/pubmed/33205128 http://dx.doi.org/10.1016/j.patter.2020.100081 |
_version_ | 1783609003107942400 |
---|---|
author | Baharav, Tavor Z. Kamath, Govinda M. Tse, David N. Shomorony, Ilan |
author_facet | Baharav, Tavor Z. Kamath, Govinda M. Tse, David N. Shomorony, Ilan |
author_sort | Baharav, Tavor Z. |
collection | PubMed |
description | Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. |
format | Online Article Text |
id | pubmed-7660437 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-76604372020-11-16 Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments Baharav, Tavor Z. Kamath, Govinda M. Tse, David N. Shomorony, Ilan Patterns (N Y) Article Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. Elsevier 2020-07-31 /pmc/articles/PMC7660437/ /pubmed/33205128 http://dx.doi.org/10.1016/j.patter.2020.100081 Text en © 2020 The Authors http://creativecommons.org/licenses/by-nc-nd/4.0/ This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Article Baharav, Tavor Z. Kamath, Govinda M. Tse, David N. Shomorony, Ilan Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments |
title | Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments |
title_full | Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments |
title_fullStr | Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments |
title_full_unstemmed | Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments |
title_short | Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments |
title_sort | spectral jaccard similarity: a new approach to estimating pairwise sequence alignments |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660437/ https://www.ncbi.nlm.nih.gov/pubmed/33205128 http://dx.doi.org/10.1016/j.patter.2020.100081 |
work_keys_str_mv | AT baharavtavorz spectraljaccardsimilarityanewapproachtoestimatingpairwisesequencealignments AT kamathgovindam spectraljaccardsimilarityanewapproachtoestimatingpairwisesequencealignments AT tsedavidn spectraljaccardsimilarityanewapproachtoestimatingpairwisesequencealignments AT shomoronyilan spectraljaccardsimilarityanewapproachtoestimatingpairwisesequencealignments |