Cargando…

A graph-based clustering method applied to protein sequences

The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous prote...

Descripción completa

Detalles Bibliográficos
Autores principales: Mishra, Pooja, Pandey, Paras Nath
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Biomedical Informatics 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163914/
https://www.ncbi.nlm.nih.gov/pubmed/21927545
_version_ 1782210993492852736
author Mishra, Pooja
Pandey, Paras Nath
author_facet Mishra, Pooja
Pandey, Paras Nath
author_sort Mishra, Pooja
collection PubMed
description The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.
format Online
Article
Text
id pubmed-3163914
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Biomedical Informatics
record_format MEDLINE/PubMed
spelling pubmed-31639142011-09-16 A graph-based clustering method applied to protein sequences Mishra, Pooja Pandey, Paras Nath Bioinformation Hypothesis The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs. Biomedical Informatics 2011-08-02 /pmc/articles/PMC3163914/ /pubmed/21927545 Text en © 2011 Biomedical Informatics This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original author and source are credited.
spellingShingle Hypothesis
Mishra, Pooja
Pandey, Paras Nath
A graph-based clustering method applied to protein sequences
title A graph-based clustering method applied to protein sequences
title_full A graph-based clustering method applied to protein sequences
title_fullStr A graph-based clustering method applied to protein sequences
title_full_unstemmed A graph-based clustering method applied to protein sequences
title_short A graph-based clustering method applied to protein sequences
title_sort graph-based clustering method applied to protein sequences
topic Hypothesis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163914/
https://www.ncbi.nlm.nih.gov/pubmed/21927545
work_keys_str_mv AT mishrapooja agraphbasedclusteringmethodappliedtoproteinsequences
AT pandeyparasnath agraphbasedclusteringmethodappliedtoproteinsequences
AT mishrapooja graphbasedclusteringmethodappliedtoproteinsequences
AT pandeyparasnath graphbasedclusteringmethodappliedtoproteinsequences