Genome classification by gene distribution: An overlapping subspace clustering approach

BACKGROUND: Genomes of lower organisms have been observed with a large amount of horizontal gene transfers, which cause difficulties in their evolutionary study. Bacteriophage genomes are a typical example. One recent approach that addresses this problem is the unsupervised clustering of genomes bas...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Jason, Halgamuge, Saman K, Tang, Sen-Lin
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2383906/
https://www.ncbi.nlm.nih.gov/pubmed/18430250
http://dx.doi.org/10.1186/1471-2148-8-116
_version_ 1782154831280996352
author Li, Jason
Halgamuge, Saman K
Tang, Sen-Lin
author_facet Li, Jason
Halgamuge, Saman K
Tang, Sen-Lin
author_sort Li, Jason
collection PubMed
description BACKGROUND: Genomes of lower organisms have been observed with a large amount of horizontal gene transfers, which cause difficulties in their evolutionary study. Bacteriophage genomes are a typical example. One recent approach that addresses this problem is the unsupervised clustering of genomes based on gene order and genome position, which helps to reveal species relationships that may not be apparent from traditional phylogenetic methods. RESULTS: We propose the use of an overlapping subspace clustering algorithm for such genome classification problems. The advantage of subspace clustering over traditional clustering is that it can associate clusters with gene arrangement patterns, preserving genomic information in the clusters produced. Additionally, overlapping capability is desirable for the discovery of multiple conserved patterns within a single genome, such as those acquired from different species via horizontal gene transfers. The proposed method involves a novel strategy to vectorize genomes based on their gene distribution. A number of existing subspace clustering and biclustering algorithms were evaluated to identify the best framework upon which to develop our algorithm; we extended a generic subspace clustering algorithm called HARP to incorporate overlapping capability. The proposed algorithm was assessed and applied on bacteriophage genomes. The phage grouping results are consistent overall with the Phage Proteomic Tree and showed common genomic characteristics among the TP901-like, Sfi21-like and sk1-like phage groups. Among 441 phage genomes, we identified four significantly conserved distribution patterns structured by the terminase, portal, integrase, holin and lysin genes. We also observed a subgroup of Sfi21-like phages comprising a distinctive divergent genome organization and identified nine new phage members to the Sfi21-like genus: Staphylococcus 71, phiPVL108, Listeria A118, 2389, Lactobacillus phi AT3, A2, Clostridium phi3626, Geobacillus GBSV1, and Listeria monocytogenes PSA. CONCLUSION: The method described in this paper can assist evolutionary study through objectively classifying genomes based on their resemblance in gene order, gene content and gene positions. The method is suitable for application to genomes with high genetic exchange and various conserved gene arrangement, as demonstrated through our application on phages.
format Text
id pubmed-2383906
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-23839062008-05-14 Genome classification by gene distribution: An overlapping subspace clustering approach Li, Jason Halgamuge, Saman K Tang, Sen-Lin BMC Evol Biol Methodology Article BACKGROUND: Genomes of lower organisms have been observed with a large amount of horizontal gene transfers, which cause difficulties in their evolutionary study. Bacteriophage genomes are a typical example. One recent approach that addresses this problem is the unsupervised clustering of genomes based on gene order and genome position, which helps to reveal species relationships that may not be apparent from traditional phylogenetic methods. RESULTS: We propose the use of an overlapping subspace clustering algorithm for such genome classification problems. The advantage of subspace clustering over traditional clustering is that it can associate clusters with gene arrangement patterns, preserving genomic information in the clusters produced. Additionally, overlapping capability is desirable for the discovery of multiple conserved patterns within a single genome, such as those acquired from different species via horizontal gene transfers. The proposed method involves a novel strategy to vectorize genomes based on their gene distribution. A number of existing subspace clustering and biclustering algorithms were evaluated to identify the best framework upon which to develop our algorithm; we extended a generic subspace clustering algorithm called HARP to incorporate overlapping capability. The proposed algorithm was assessed and applied on bacteriophage genomes. The phage grouping results are consistent overall with the Phage Proteomic Tree and showed common genomic characteristics among the TP901-like, Sfi21-like and sk1-like phage groups. Among 441 phage genomes, we identified four significantly conserved distribution patterns structured by the terminase, portal, integrase, holin and lysin genes. We also observed a subgroup of Sfi21-like phages comprising a distinctive divergent genome organization and identified nine new phage members to the Sfi21-like genus: Staphylococcus 71, phiPVL108, Listeria A118, 2389, Lactobacillus phi AT3, A2, Clostridium phi3626, Geobacillus GBSV1, and Listeria monocytogenes PSA. CONCLUSION: The method described in this paper can assist evolutionary study through objectively classifying genomes based on their resemblance in gene order, gene content and gene positions. The method is suitable for application to genomes with high genetic exchange and various conserved gene arrangement, as demonstrated through our application on phages. BioMed Central 2008-04-23 /pmc/articles/PMC2383906/ /pubmed/18430250 http://dx.doi.org/10.1186/1471-2148-8-116 Text en Copyright ©2008 Li et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Li, Jason
Halgamuge, Saman K
Tang, Sen-Lin
Genome classification by gene distribution: An overlapping subspace clustering approach
title Genome classification by gene distribution: An overlapping subspace clustering approach
title_full Genome classification by gene distribution: An overlapping subspace clustering approach
title_fullStr Genome classification by gene distribution: An overlapping subspace clustering approach
title_full_unstemmed Genome classification by gene distribution: An overlapping subspace clustering approach
title_short Genome classification by gene distribution: An overlapping subspace clustering approach
title_sort genome classification by gene distribution: an overlapping subspace clustering approach
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2383906/
https://www.ncbi.nlm.nih.gov/pubmed/18430250
http://dx.doi.org/10.1186/1471-2148-8-116
work_keys_str_mv AT lijason genomeclassificationbygenedistributionanoverlappingsubspaceclusteringapproach
AT halgamugesamank genomeclassificationbygenedistributionanoverlappingsubspaceclusteringapproach
AT tangsenlin genomeclassificationbygenedistributionanoverlappingsubspaceclusteringapproach