Cargando…

Quantitative analysis of visual codewords of a protein distance matrix

3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative an...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pražnikar, Jure, Attygalle, Nuwan Tharanga
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8815937/ https://www.ncbi.nlm.nih.gov/pubmed/35120181 http://dx.doi.org/10.1371/journal.pone.0263566

_version_	1784645341414424576
author	Pražnikar, Jure Attygalle, Nuwan Tharanga
author_facet	Pražnikar, Jure Attygalle, Nuwan Tharanga
author_sort	Pražnikar, Jure
collection	PubMed
description	3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative analysis of distance matrix features. Therefore, the distance matrix was analyzed as gray scale image using KAZE feature extractor algorithm with Bag of Visual Words model. In this study, each protein was represented as a histogram of visual codewords. The analysis showed that a very small number of codewords (~1%) have a high relative frequency (> 0.25) and that the majority of codewords have a relative frequency around 0.05. We have also shown that there is a relationship between the frequency of codewords and the position of the features in a distance matrix. The codewords that are more frequent are located closer to the main diagonal. Less frequent codewords, on the other hand, are located in the corners of the distance matrix, far from the main diagonal. Moreover, the analysis showed a correlation between the number of unique codewords and the 3D repeats in the protein structure. The solenoid and tandem repeats proteins have a significantly lower number of unique codewords than the globular proteins. Finally, the codeword histograms and Support Vector Machine (SVM) classifier were used to classify solenoid and globular proteins. The result showed that the SVM classifier fed with codeword histograms correctly classified 352 out of 354 proteins.
format	Online Article Text
id	pubmed-8815937
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-88159372022-02-05 Quantitative analysis of visual codewords of a protein distance matrix Pražnikar, Jure Attygalle, Nuwan Tharanga PLoS One Research Article 3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative analysis of distance matrix features. Therefore, the distance matrix was analyzed as gray scale image using KAZE feature extractor algorithm with Bag of Visual Words model. In this study, each protein was represented as a histogram of visual codewords. The analysis showed that a very small number of codewords (~1%) have a high relative frequency (> 0.25) and that the majority of codewords have a relative frequency around 0.05. We have also shown that there is a relationship between the frequency of codewords and the position of the features in a distance matrix. The codewords that are more frequent are located closer to the main diagonal. Less frequent codewords, on the other hand, are located in the corners of the distance matrix, far from the main diagonal. Moreover, the analysis showed a correlation between the number of unique codewords and the 3D repeats in the protein structure. The solenoid and tandem repeats proteins have a significantly lower number of unique codewords than the globular proteins. Finally, the codeword histograms and Support Vector Machine (SVM) classifier were used to classify solenoid and globular proteins. The result showed that the SVM classifier fed with codeword histograms correctly classified 352 out of 354 proteins. Public Library of Science 2022-02-04 /pmc/articles/PMC8815937/ /pubmed/35120181 http://dx.doi.org/10.1371/journal.pone.0263566 Text en © 2022 Pražnikar, Attygalle https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Pražnikar, Jure Attygalle, Nuwan Tharanga Quantitative analysis of visual codewords of a protein distance matrix
title	Quantitative analysis of visual codewords of a protein distance matrix
title_full	Quantitative analysis of visual codewords of a protein distance matrix
title_fullStr	Quantitative analysis of visual codewords of a protein distance matrix
title_full_unstemmed	Quantitative analysis of visual codewords of a protein distance matrix
title_short	Quantitative analysis of visual codewords of a protein distance matrix
title_sort	quantitative analysis of visual codewords of a protein distance matrix
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8815937/ https://www.ncbi.nlm.nih.gov/pubmed/35120181 http://dx.doi.org/10.1371/journal.pone.0263566
work_keys_str_mv	AT praznikarjure quantitativeanalysisofvisualcodewordsofaproteindistancematrix AT attygallenuwantharanga quantitativeanalysisofvisualcodewordsofaproteindistancematrix

Quantitative analysis of visual codewords of a protein distance matrix

Ejemplares similares