Cargando…
Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7850047/ https://www.ncbi.nlm.nih.gov/pubmed/33554117 http://dx.doi.org/10.1093/nargab/lqab001 |
_version_ | 1783645398801317888 |
---|---|
author | Girgis, Hani Z James, Benjamin T Luczak, Brian B |
author_facet | Girgis, Hani Z James, Benjamin T Luczak, Brian B |
author_sort | Girgis, Hani Z |
collection | PubMed |
description | Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity. |
format | Online Article Text |
id | pubmed-7850047 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-78500472021-02-04 Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models Girgis, Hani Z James, Benjamin T Luczak, Brian B NAR Genom Bioinform Methods Article Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity. Oxford University Press 2021-02-01 /pmc/articles/PMC7850047/ /pubmed/33554117 http://dx.doi.org/10.1093/nargab/lqab001 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Methods Article Girgis, Hani Z James, Benjamin T Luczak, Brian B Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models |
title |
Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models |
title_full |
Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models |
title_fullStr |
Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models |
title_full_unstemmed |
Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models |
title_short |
Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models |
title_sort | identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models |
topic | Methods Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7850047/ https://www.ncbi.nlm.nih.gov/pubmed/33554117 http://dx.doi.org/10.1093/nargab/lqab001 |
work_keys_str_mv | AT girgishaniz identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels AT jamesbenjamint identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels AT luczakbrianb identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels |