Cargando…

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one funct...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ma, Qicheng, Chirn, Gung-Wei, Cai, Richard, Szustakowski, Joseph D, Nirmala, NR
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2005
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1261163/ https://www.ncbi.nlm.nih.gov/pubmed/16202129 http://dx.doi.org/10.1186/1471-2105-6-242

_version_	1782125867242094592
author	Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR
author_facet	Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR
author_sort	Ma, Qicheng
collection	PubMed
description	BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes.
format	Text
id	pubmed-1261163
institution	National Center for Biotechnology Information
language	English
publishDate	2005
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-12611632005-10-26 Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR BMC Bioinformatics Methodology Article BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes. BioMed Central 2005-10-03 /pmc/articles/PMC1261163/ /pubmed/16202129 http://dx.doi.org/10.1186/1471-2105-6-242 Text en Copyright © 2005 Ma et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title	Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_full	Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_fullStr	Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_full_unstemmed	Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_short	Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
title_sort	clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1261163/ https://www.ncbi.nlm.nih.gov/pubmed/16202129 http://dx.doi.org/10.1186/1471-2105-6-242
work_keys_str_mv	AT maqicheng clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT chirngungwei clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT cairichard clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT szustakowskijosephd clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT nirmalanr clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

Ejemplares similares