Cargando…
Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one funct...
Autores principales: | , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2005
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1261163/ https://www.ncbi.nlm.nih.gov/pubmed/16202129 http://dx.doi.org/10.1186/1471-2105-6-242 |
_version_ | 1782125867242094592 |
---|---|
author | Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR |
author_facet | Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR |
author_sort | Ma, Qicheng |
collection | PubMed |
description | BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes. |
format | Text |
id | pubmed-1261163 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2005 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-12611632005-10-26 Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR BMC Bioinformatics Methodology Article BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes. BioMed Central 2005-10-03 /pmc/articles/PMC1261163/ /pubmed/16202129 http://dx.doi.org/10.1186/1471-2105-6-242 Text en Copyright © 2005 Ma et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Ma, Qicheng Chirn, Gung-Wei Cai, Richard Szustakowski, Joseph D Nirmala, NR Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks |
title | Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks |
title_full | Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks |
title_fullStr | Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks |
title_full_unstemmed | Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks |
title_short | Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks |
title_sort | clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1261163/ https://www.ncbi.nlm.nih.gov/pubmed/16202129 http://dx.doi.org/10.1186/1471-2105-6-242 |
work_keys_str_mv | AT maqicheng clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT chirngungwei clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT cairichard clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT szustakowskijosephd clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks AT nirmalanr clusteringproteinsequenceswithanovelmetrictransformedfromsequencesimilarityscoresandsequencealignmentswithneuralnetworks |