Cargando…

Massive fungal biodiversity data re-annotation with multi-level clustering

With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identificati...

Descripción completa

Detalles Bibliográficos
Autores principales: Vu, Duong, Szöke, Szániszló, Wiwie, Christian, Baumbach, Jan, Cardinali, Gianluigi, Röttger, Richard, Robert, Vincent
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213798/
https://www.ncbi.nlm.nih.gov/pubmed/25355642
http://dx.doi.org/10.1038/srep06837
_version_ 1782341868854444032
author Vu, Duong
Szöke, Szániszló
Wiwie, Christian
Baumbach, Jan
Cardinali, Gianluigi
Röttger, Richard
Robert, Vincent
author_facet Vu, Duong
Szöke, Szániszló
Wiwie, Christian
Baumbach, Jan
Cardinali, Gianluigi
Röttger, Richard
Robert, Vincent
author_sort Vu, Duong
collection PubMed
description With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.
format Online
Article
Text
id pubmed-4213798
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-42137982014-10-31 Massive fungal biodiversity data re-annotation with multi-level clustering Vu, Duong Szöke, Szániszló Wiwie, Christian Baumbach, Jan Cardinali, Gianluigi Röttger, Richard Robert, Vincent Sci Rep Article With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours. Nature Publishing Group 2014-10-30 /pmc/articles/PMC4213798/ /pubmed/25355642 http://dx.doi.org/10.1038/srep06837 Text en Copyright © 2014, Macmillan Publishers Limited. All rights reserved http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder in order to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Article
Vu, Duong
Szöke, Szániszló
Wiwie, Christian
Baumbach, Jan
Cardinali, Gianluigi
Röttger, Richard
Robert, Vincent
Massive fungal biodiversity data re-annotation with multi-level clustering
title Massive fungal biodiversity data re-annotation with multi-level clustering
title_full Massive fungal biodiversity data re-annotation with multi-level clustering
title_fullStr Massive fungal biodiversity data re-annotation with multi-level clustering
title_full_unstemmed Massive fungal biodiversity data re-annotation with multi-level clustering
title_short Massive fungal biodiversity data re-annotation with multi-level clustering
title_sort massive fungal biodiversity data re-annotation with multi-level clustering
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4213798/
https://www.ncbi.nlm.nih.gov/pubmed/25355642
http://dx.doi.org/10.1038/srep06837
work_keys_str_mv AT vuduong massivefungalbiodiversitydatareannotationwithmultilevelclustering
AT szokeszaniszlo massivefungalbiodiversitydatareannotationwithmultilevelclustering
AT wiwiechristian massivefungalbiodiversitydatareannotationwithmultilevelclustering
AT baumbachjan massivefungalbiodiversitydatareannotationwithmultilevelclustering
AT cardinaligianluigi massivefungalbiodiversitydatareannotationwithmultilevelclustering
AT rottgerrichard massivefungalbiodiversitydatareannotationwithmultilevelclustering
AT robertvincent massivefungalbiodiversitydatareannotationwithmultilevelclustering