Cargando…

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in...

Descripción completa

Detalles Bibliográficos
Autores principales: Girgis, Hani Z, James, Benjamin T, Luczak, Brian B
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7850047/
https://www.ncbi.nlm.nih.gov/pubmed/33554117
http://dx.doi.org/10.1093/nargab/lqab001
_version_ 1783645398801317888
author Girgis, Hani Z
James, Benjamin T
Luczak, Brian B
author_facet Girgis, Hani Z
James, Benjamin T
Luczak, Brian B
author_sort Girgis, Hani Z
collection PubMed
description Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.
format Online
Article
Text
id pubmed-7850047
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-78500472021-02-04 Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models Girgis, Hani Z James, Benjamin T Luczak, Brian B NAR Genom Bioinform Methods Article Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity. Oxford University Press 2021-02-01 /pmc/articles/PMC7850047/ /pubmed/33554117 http://dx.doi.org/10.1093/nargab/lqab001 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Article
Girgis, Hani Z
James, Benjamin T
Luczak, Brian B
Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_full Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_fullStr Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_full_unstemmed Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_short Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_sort identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
topic Methods Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7850047/
https://www.ncbi.nlm.nih.gov/pubmed/33554117
http://dx.doi.org/10.1093/nargab/lqab001
work_keys_str_mv AT girgishaniz identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels
AT jamesbenjamint identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels
AT luczakbrianb identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels