Cargando…

iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences

SUMMARY: We present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly...

Descripción completa

Detalles Bibliográficos
Autores principales: Millan Arias, Pablo, Hill, Kathleen A, Kari, Lila
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10483029/
https://www.ncbi.nlm.nih.gov/pubmed/37589603
http://dx.doi.org/10.1093/bioinformatics/btad508
_version_ 1785102290037768192
author Millan Arias, Pablo
Hill, Kathleen A
Kari, Lila
author_facet Millan Arias, Pablo
Hill, Kathleen A
Kari, Lila
author_sort Millan Arias, Pablo
collection PubMed
description SUMMARY: We present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly: its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning. The performance of iDeLUCS was evaluated on a diverse set of datasets: several real genomic datasets from organisms in kingdoms Animalia, Protista, Fungi, Bacteria, and Archaea, three datasets of viral genomes, a dataset of simulated metagenomic reads from microbial genomes, and multiple datasets of synthetic DNA sequences. The performance of iDeLUCS was compared to that of two classical clustering algorithms (k-means++ and GMM) and two clustering algorithms specialized in DNA sequences (MeShClust v3.0 and DeLUCS), using both intrinsic cluster evaluation metrics and external evaluation metrics. In terms of unsupervised clustering accuracy, iDeLUCS outperforms the two classical algorithms by an average of [Formula: see text] , and the two specialized algorithms by an average of [Formula: see text] , on the datasets of real DNA sequences analyzed. Overall, our results indicate that iDeLUCS is a robust clustering method suitable for the clustering of large and diverse datasets of unlabeled DNA sequences. AVAILABILITY AND IMPLEMENTATION: iDeLUCS is available at https://github.com/Kari-Genomics-Lab/iDeLUCS under the terms of the MIT licence.
format Online
Article
Text
id pubmed-10483029
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104830292023-09-08 iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences Millan Arias, Pablo Hill, Kathleen A Kari, Lila Bioinformatics Applications Note SUMMARY: We present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly: its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning. The performance of iDeLUCS was evaluated on a diverse set of datasets: several real genomic datasets from organisms in kingdoms Animalia, Protista, Fungi, Bacteria, and Archaea, three datasets of viral genomes, a dataset of simulated metagenomic reads from microbial genomes, and multiple datasets of synthetic DNA sequences. The performance of iDeLUCS was compared to that of two classical clustering algorithms (k-means++ and GMM) and two clustering algorithms specialized in DNA sequences (MeShClust v3.0 and DeLUCS), using both intrinsic cluster evaluation metrics and external evaluation metrics. In terms of unsupervised clustering accuracy, iDeLUCS outperforms the two classical algorithms by an average of [Formula: see text] , and the two specialized algorithms by an average of [Formula: see text] , on the datasets of real DNA sequences analyzed. Overall, our results indicate that iDeLUCS is a robust clustering method suitable for the clustering of large and diverse datasets of unlabeled DNA sequences. AVAILABILITY AND IMPLEMENTATION: iDeLUCS is available at https://github.com/Kari-Genomics-Lab/iDeLUCS under the terms of the MIT licence. Oxford University Press 2023-08-17 /pmc/articles/PMC10483029/ /pubmed/37589603 http://dx.doi.org/10.1093/bioinformatics/btad508 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Applications Note
Millan Arias, Pablo
Hill, Kathleen A
Kari, Lila
iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences
title iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences
title_full iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences
title_fullStr iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences
title_full_unstemmed iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences
title_short iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences
title_sort idelucs: a deep learning interactive tool for alignment-free clustering of dna sequences
topic Applications Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10483029/
https://www.ncbi.nlm.nih.gov/pubmed/37589603
http://dx.doi.org/10.1093/bioinformatics/btad508
work_keys_str_mv AT millanariaspablo idelucsadeeplearninginteractivetoolforalignmentfreeclusteringofdnasequences
AT hillkathleena idelucsadeeplearninginteractivetoolforalignmentfreeclusteringofdnasequences
AT karilila idelucsadeeplearninginteractivetoolforalignmentfreeclusteringofdnasequences