Cargando…

A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors

We propose a feature vector approach to characterize the variation in large data sets of biological sequences. Each candidate sequence produces a single feature vector constructed with the number and location of amino acids or nucleic acids in the sequence. The feature vector characterizes the dista...

Descripción completa

Detalles Bibliográficos
Autores principales: Carr, Kareem, Murray, Eleanor, Armah, Ebenezer, He, Rong L., Yau, Stephen S.-T.
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2832692/
https://www.ncbi.nlm.nih.gov/pubmed/20221427
http://dx.doi.org/10.1371/journal.pone.0009550
_version_ 1782178333740498944
author Carr, Kareem
Murray, Eleanor
Armah, Ebenezer
He, Rong L.
Yau, Stephen S.-T.
author_facet Carr, Kareem
Murray, Eleanor
Armah, Ebenezer
He, Rong L.
Yau, Stephen S.-T.
author_sort Carr, Kareem
collection PubMed
description We propose a feature vector approach to characterize the variation in large data sets of biological sequences. Each candidate sequence produces a single feature vector constructed with the number and location of amino acids or nucleic acids in the sequence. The feature vector characterizes the distance between the actual sequence and a model of a theoretical sequence based on the binomial and uniform distributions. This method is distinctive in that it does not rely on sequence alignment for determining protein relatedness, allowing the user to visualize the relationships within a set of proteins without making a priori assumptions about those proteins. We apply our method to two large families of proteins: protein kinase C, and globins, including hemoglobins and myoglobins. We interpret the high-dimensional feature vectors using principal components analysis and agglomerative hierarchical clustering. We find that the feature vector retains much of the information about the original sequence. By using principal component analysis to extract information from collections of feature vectors, we are able to quickly identify the nature of variation in a collection of proteins. Where collections are phylogenetically or functionally related, this is easily detected. Hierarchical agglomerative clustering provides a means of constructing cladograms from the feature vector output.
format Text
id pubmed-2832692
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-28326922010-03-11 A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors Carr, Kareem Murray, Eleanor Armah, Ebenezer He, Rong L. Yau, Stephen S.-T. PLoS One Research Article We propose a feature vector approach to characterize the variation in large data sets of biological sequences. Each candidate sequence produces a single feature vector constructed with the number and location of amino acids or nucleic acids in the sequence. The feature vector characterizes the distance between the actual sequence and a model of a theoretical sequence based on the binomial and uniform distributions. This method is distinctive in that it does not rely on sequence alignment for determining protein relatedness, allowing the user to visualize the relationships within a set of proteins without making a priori assumptions about those proteins. We apply our method to two large families of proteins: protein kinase C, and globins, including hemoglobins and myoglobins. We interpret the high-dimensional feature vectors using principal components analysis and agglomerative hierarchical clustering. We find that the feature vector retains much of the information about the original sequence. By using principal component analysis to extract information from collections of feature vectors, we are able to quickly identify the nature of variation in a collection of proteins. Where collections are phylogenetically or functionally related, this is easily detected. Hierarchical agglomerative clustering provides a means of constructing cladograms from the feature vector output. Public Library of Science 2010-03-05 /pmc/articles/PMC2832692/ /pubmed/20221427 http://dx.doi.org/10.1371/journal.pone.0009550 Text en Carr et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Carr, Kareem
Murray, Eleanor
Armah, Ebenezer
He, Rong L.
Yau, Stephen S.-T.
A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors
title A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors
title_full A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors
title_fullStr A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors
title_full_unstemmed A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors
title_short A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors
title_sort rapid method for characterization of protein relatedness using feature vectors
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2832692/
https://www.ncbi.nlm.nih.gov/pubmed/20221427
http://dx.doi.org/10.1371/journal.pone.0009550
work_keys_str_mv AT carrkareem arapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT murrayeleanor arapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT armahebenezer arapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT herongl arapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT yaustephenst arapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT carrkareem rapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT murrayeleanor rapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT armahebenezer rapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT herongl rapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors
AT yaustephenst rapidmethodforcharacterizationofproteinrelatednessusingfeaturevectors