Cargando…
A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences
BACKGROUND: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological s...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3022630/ https://www.ncbi.nlm.nih.gov/pubmed/21167044 http://dx.doi.org/10.1186/1471-2105-11-601 |
_version_ | 1782196537189728256 |
---|---|
author | Russell, David J Way, Samuel F Benson, Andrew K Sayood, Khalid |
author_facet | Russell, David J Way, Samuel F Benson, Andrew K Sayood, Khalid |
author_sort | Russell, David J |
collection | PubMed |
description | BACKGROUND: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. RESULTS: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. CONCLUSIONS: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences. |
format | Text |
id | pubmed-3022630 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-30226302011-01-20 A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences Russell, David J Way, Samuel F Benson, Andrew K Sayood, Khalid BMC Bioinformatics Methodology Article BACKGROUND: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. RESULTS: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. CONCLUSIONS: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences. BioMed Central 2010-12-17 /pmc/articles/PMC3022630/ /pubmed/21167044 http://dx.doi.org/10.1186/1471-2105-11-601 Text en Copyright ©2010 Russell et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Russell, David J Way, Samuel F Benson, Andrew K Sayood, Khalid A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences |
title | A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences |
title_full | A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences |
title_fullStr | A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences |
title_full_unstemmed | A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences |
title_short | A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences |
title_sort | grammar-based distance metric enables fast and accurate clustering of large sets of 16s sequences |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3022630/ https://www.ncbi.nlm.nih.gov/pubmed/21167044 http://dx.doi.org/10.1186/1471-2105-11-601 |
work_keys_str_mv | AT russelldavidj agrammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences AT waysamuelf agrammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences AT bensonandrewk agrammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences AT sayoodkhalid agrammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences AT russelldavidj grammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences AT waysamuelf grammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences AT bensonandrewk grammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences AT sayoodkhalid grammarbaseddistancemetricenablesfastandaccurateclusteringoflargesetsof16ssequences |