Cargando…

Extended many-item similarity indices for sets of nucleotide and protein sequences

Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substituti...

Descripción completa

Detalles Bibliográficos
Autores principales: Bajusz, Dávid, Miranda-Quintana, Ramón Alain, Rácz, Anita, Héberger, Károly
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8253954/
https://www.ncbi.nlm.nih.gov/pubmed/34257841
http://dx.doi.org/10.1016/j.csbj.2021.06.021
_version_ 1783717626779795456
author Bajusz, Dávid
Miranda-Quintana, Ramón Alain
Rácz, Anita
Héberger, Károly
author_facet Bajusz, Dávid
Miranda-Quintana, Ramón Alain
Rácz, Anita
Héberger, Károly
author_sort Bajusz, Dávid
collection PubMed
description Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons.
format Online
Article
Text
id pubmed-8253954
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-82539542021-07-12 Extended many-item similarity indices for sets of nucleotide and protein sequences Bajusz, Dávid Miranda-Quintana, Ramón Alain Rácz, Anita Héberger, Károly Comput Struct Biotechnol J Research Article Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons. Research Network of Computational and Structural Biotechnology 2021-06-16 /pmc/articles/PMC8253954/ /pubmed/34257841 http://dx.doi.org/10.1016/j.csbj.2021.06.021 Text en © 2021 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Research Article
Bajusz, Dávid
Miranda-Quintana, Ramón Alain
Rácz, Anita
Héberger, Károly
Extended many-item similarity indices for sets of nucleotide and protein sequences
title Extended many-item similarity indices for sets of nucleotide and protein sequences
title_full Extended many-item similarity indices for sets of nucleotide and protein sequences
title_fullStr Extended many-item similarity indices for sets of nucleotide and protein sequences
title_full_unstemmed Extended many-item similarity indices for sets of nucleotide and protein sequences
title_short Extended many-item similarity indices for sets of nucleotide and protein sequences
title_sort extended many-item similarity indices for sets of nucleotide and protein sequences
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8253954/
https://www.ncbi.nlm.nih.gov/pubmed/34257841
http://dx.doi.org/10.1016/j.csbj.2021.06.021
work_keys_str_mv AT bajuszdavid extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences
AT mirandaquintanaramonalain extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences
AT raczanita extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences
AT hebergerkaroly extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences