Cargando…
Extended many-item similarity indices for sets of nucleotide and protein sequences
Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substituti...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Research Network of Computational and Structural Biotechnology
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8253954/ https://www.ncbi.nlm.nih.gov/pubmed/34257841 http://dx.doi.org/10.1016/j.csbj.2021.06.021 |
_version_ | 1783717626779795456 |
---|---|
author | Bajusz, Dávid Miranda-Quintana, Ramón Alain Rácz, Anita Héberger, Károly |
author_facet | Bajusz, Dávid Miranda-Quintana, Ramón Alain Rácz, Anita Héberger, Károly |
author_sort | Bajusz, Dávid |
collection | PubMed |
description | Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons. |
format | Online Article Text |
id | pubmed-8253954 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Research Network of Computational and Structural Biotechnology |
record_format | MEDLINE/PubMed |
spelling | pubmed-82539542021-07-12 Extended many-item similarity indices for sets of nucleotide and protein sequences Bajusz, Dávid Miranda-Quintana, Ramón Alain Rácz, Anita Héberger, Károly Comput Struct Biotechnol J Research Article Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple (e.g. percent identity) or more intricate concepts (e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (i.e., direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two (t) possible items (e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons. Research Network of Computational and Structural Biotechnology 2021-06-16 /pmc/articles/PMC8253954/ /pubmed/34257841 http://dx.doi.org/10.1016/j.csbj.2021.06.021 Text en © 2021 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Research Article Bajusz, Dávid Miranda-Quintana, Ramón Alain Rácz, Anita Héberger, Károly Extended many-item similarity indices for sets of nucleotide and protein sequences |
title | Extended many-item similarity indices for sets of nucleotide and protein sequences |
title_full | Extended many-item similarity indices for sets of nucleotide and protein sequences |
title_fullStr | Extended many-item similarity indices for sets of nucleotide and protein sequences |
title_full_unstemmed | Extended many-item similarity indices for sets of nucleotide and protein sequences |
title_short | Extended many-item similarity indices for sets of nucleotide and protein sequences |
title_sort | extended many-item similarity indices for sets of nucleotide and protein sequences |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8253954/ https://www.ncbi.nlm.nih.gov/pubmed/34257841 http://dx.doi.org/10.1016/j.csbj.2021.06.021 |
work_keys_str_mv | AT bajuszdavid extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences AT mirandaquintanaramonalain extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences AT raczanita extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences AT hebergerkaroly extendedmanyitemsimilarityindicesforsetsofnucleotideandproteinsequences |