Cargando…
Massive fungal biodiversity data re-annotation with multi-level clustering
With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identificati...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213798/ https://www.ncbi.nlm.nih.gov/pubmed/25355642 http://dx.doi.org/10.1038/srep06837 |
_version_ | 1782341868854444032 |
---|---|
author | Vu, Duong Szöke, Szániszló Wiwie, Christian Baumbach, Jan Cardinali, Gianluigi Röttger, Richard Robert, Vincent |
author_facet | Vu, Duong Szöke, Szániszló Wiwie, Christian Baumbach, Jan Cardinali, Gianluigi Röttger, Richard Robert, Vincent |
author_sort | Vu, Duong |
collection | PubMed |
description | With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours. |
format | Online Article Text |
id | pubmed-4213798 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-42137982014-10-31 Massive fungal biodiversity data re-annotation with multi-level clustering Vu, Duong Szöke, Szániszló Wiwie, Christian Baumbach, Jan Cardinali, Gianluigi Röttger, Richard Robert, Vincent Sci Rep Article With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours. Nature Publishing Group 2014-10-30 /pmc/articles/PMC4213798/ /pubmed/25355642 http://dx.doi.org/10.1038/srep06837 Text en Copyright © 2014, Macmillan Publishers Limited. All rights reserved http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder in order to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ |
spellingShingle | Article Vu, Duong Szöke, Szániszló Wiwie, Christian Baumbach, Jan Cardinali, Gianluigi Röttger, Richard Robert, Vincent Massive fungal biodiversity data re-annotation with multi-level clustering |
title | Massive fungal biodiversity data re-annotation with multi-level clustering |
title_full | Massive fungal biodiversity data re-annotation with multi-level clustering |
title_fullStr | Massive fungal biodiversity data re-annotation with multi-level clustering |
title_full_unstemmed | Massive fungal biodiversity data re-annotation with multi-level clustering |
title_short | Massive fungal biodiversity data re-annotation with multi-level clustering |
title_sort | massive fungal biodiversity data re-annotation with multi-level clustering |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213798/ https://www.ncbi.nlm.nih.gov/pubmed/25355642 http://dx.doi.org/10.1038/srep06837 |
work_keys_str_mv | AT vuduong massivefungalbiodiversitydatareannotationwithmultilevelclustering AT szokeszaniszlo massivefungalbiodiversitydatareannotationwithmultilevelclustering AT wiwiechristian massivefungalbiodiversitydatareannotationwithmultilevelclustering AT baumbachjan massivefungalbiodiversitydatareannotationwithmultilevelclustering AT cardinaligianluigi massivefungalbiodiversitydatareannotationwithmultilevelclustering AT rottgerrichard massivefungalbiodiversitydatareannotationwithmultilevelclustering AT robertvincent massivefungalbiodiversitydatareannotationwithmultilevelclustering |