Cargando…

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Girgis, Hani Z, James, Benjamin T, Luczak, Brian B
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Methods Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7850047/ https://www.ncbi.nlm.nih.gov/pubmed/33554117 http://dx.doi.org/10.1093/nargab/lqab001

_version_	1783645398801317888
author	Girgis, Hani Z James, Benjamin T Luczak, Brian B
author_facet	Girgis, Hani Z James, Benjamin T Luczak, Brian B
author_sort	Girgis, Hani Z
collection	PubMed
description	Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.
format	Online Article Text
id	pubmed-7850047
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-78500472021-02-04 Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models Girgis, Hani Z James, Benjamin T Luczak, Brian B NAR Genom Bioinform Methods Article Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic—slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment—including gaps—of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2–80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity. Oxford University Press 2021-02-01 /pmc/articles/PMC7850047/ /pubmed/33554117 http://dx.doi.org/10.1093/nargab/lqab001 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Methods Article Girgis, Hani Z James, Benjamin T Luczak, Brian B Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title	Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_full	Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_fullStr	Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_full_unstemmed	Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_short	Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
title_sort	identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models
topic	Methods Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7850047/ https://www.ncbi.nlm.nih.gov/pubmed/33554117 http://dx.doi.org/10.1093/nargab/lqab001
work_keys_str_mv	AT girgishaniz identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels AT jamesbenjamint identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels AT luczakbrianb identityrapidalignmentfreepredictionofsequencealignmentidentityscoresusingselfsupervisedgenerallinearmodels

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Ejemplares similares