Cargando…

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one funct...

Descripción completa

Detalles Bibliográficos
Autores principales: Ma, Qicheng, Chirn, Gung-Wei, Cai, Richard, Szustakowski, Joseph D, Nirmala, NR
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1261163/
https://www.ncbi.nlm.nih.gov/pubmed/16202129
http://dx.doi.org/10.1186/1471-2105-6-242
_version_ 1782125867242094592
author Ma, Qicheng
Chirn, Gung-Wei
Cai, Richard
Szustakowski, Joseph D
Nirmala, NR
author_facet Ma, Qicheng
Chirn, Gung-Wei
Cai, Richard
Szustakowski, Joseph D
Nirmala, NR
author_sort Ma, Qicheng
collection PubMed
description BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes.
format Text
id pubmed-1261163
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-12611632005-10-26 Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR BMC Bioinformatics Methodology Article BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes. BioMed Central 2005-10-03 /pmc/articles/PMC1261163/ /pubmed/16202129 http://dx.doi.org/10.1186/1471-2105-6-242 Text en Copyright © 2005 Ma et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Ma, Qicheng
Chirn, Gung-Wei
Cai, Richard
Szustakowski, Joseph D
Nirmala, NR
Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_full Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_fullStr Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_full_unstemmed Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_short Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_sort clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1261163/
https://www.ncbi.nlm.nih.gov/pubmed/16202129
http://dx.doi.org/10.1186/1471-2105-6-242
work_keys_str_mv AT maqicheng clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks
AT chirngungwei clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks
AT cairichard clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks
AT szustakowskijosephd clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks
AT nirmalanr clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks