Cargando…

Quantitative analysis of visual codewords of a protein distance matrix

3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative an...

Descripción completa

Detalles Bibliográficos
Autores principales: Pražnikar, Jure, Attygalle, Nuwan Tharanga
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8815937/
https://www.ncbi.nlm.nih.gov/pubmed/35120181
http://dx.doi.org/10.1371/journal.pone.0263566
_version_ 1784645341414424576
author Pražnikar, Jure
Attygalle, Nuwan Tharanga
author_facet Pražnikar, Jure
Attygalle, Nuwan Tharanga
author_sort Pražnikar, Jure
collection PubMed
description 3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative analysis of distance matrix features. Therefore, the distance matrix was analyzed as gray scale image using KAZE feature extractor algorithm with Bag of Visual Words model. In this study, each protein was represented as a histogram of visual codewords. The analysis showed that a very small number of codewords (~1%) have a high relative frequency (> 0.25) and that the majority of codewords have a relative frequency around 0.05. We have also shown that there is a relationship between the frequency of codewords and the position of the features in a distance matrix. The codewords that are more frequent are located closer to the main diagonal. Less frequent codewords, on the other hand, are located in the corners of the distance matrix, far from the main diagonal. Moreover, the analysis showed a correlation between the number of unique codewords and the 3D repeats in the protein structure. The solenoid and tandem repeats proteins have a significantly lower number of unique codewords than the globular proteins. Finally, the codeword histograms and Support Vector Machine (SVM) classifier were used to classify solenoid and globular proteins. The result showed that the SVM classifier fed with codeword histograms correctly classified 352 out of 354 proteins.
format Online
Article
Text
id pubmed-8815937
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-88159372022-02-05 Quantitative analysis of visual codewords of a protein distance matrix Pražnikar, Jure Attygalle, Nuwan Tharanga PLoS One Research Article 3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative analysis of distance matrix features. Therefore, the distance matrix was analyzed as gray scale image using KAZE feature extractor algorithm with Bag of Visual Words model. In this study, each protein was represented as a histogram of visual codewords. The analysis showed that a very small number of codewords (~1%) have a high relative frequency (> 0.25) and that the majority of codewords have a relative frequency around 0.05. We have also shown that there is a relationship between the frequency of codewords and the position of the features in a distance matrix. The codewords that are more frequent are located closer to the main diagonal. Less frequent codewords, on the other hand, are located in the corners of the distance matrix, far from the main diagonal. Moreover, the analysis showed a correlation between the number of unique codewords and the 3D repeats in the protein structure. The solenoid and tandem repeats proteins have a significantly lower number of unique codewords than the globular proteins. Finally, the codeword histograms and Support Vector Machine (SVM) classifier were used to classify solenoid and globular proteins. The result showed that the SVM classifier fed with codeword histograms correctly classified 352 out of 354 proteins. Public Library of Science 2022-02-04 /pmc/articles/PMC8815937/ /pubmed/35120181 http://dx.doi.org/10.1371/journal.pone.0263566 Text en © 2022 Pražnikar, Attygalle https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Pražnikar, Jure
Attygalle, Nuwan Tharanga
Quantitative analysis of visual codewords of a protein distance matrix
title Quantitative analysis of visual codewords of a protein distance matrix
title_full Quantitative analysis of visual codewords of a protein distance matrix
title_fullStr Quantitative analysis of visual codewords of a protein distance matrix
title_full_unstemmed Quantitative analysis of visual codewords of a protein distance matrix
title_short Quantitative analysis of visual codewords of a protein distance matrix
title_sort quantitative analysis of visual codewords of a protein distance matrix
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8815937/
https://www.ncbi.nlm.nih.gov/pubmed/35120181
http://dx.doi.org/10.1371/journal.pone.0263566
work_keys_str_mv AT praznikarjure quantitativeanalysisofvisualcodewordsofaproteindistancematrix
AT attygallenuwantharanga quantitativeanalysisofvisualcodewordsofaproteindistancematrix