Cargando…

RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets

BACKGROUND: Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant in...

Descripción completa

Detalles Bibliográficos
Autores principales: de Lima Nichio, Bruno Thiago, de Oliveira, Aryel Marlus Repula, de Pierri, Camilla Reginatto, Santos, Leticia Graziela Costa, Lejambre, Alexandre Quadros, Vialle, Ricardo Assunção, da Rocha Coimbra, Nilson Antônio, Guizelini, Dieval, Marchaukoski, Jeroniza Nunes, de Oliveira Pedrosa, Fabio, Raittz, Roberto Tadeu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6631606/
https://www.ncbi.nlm.nih.gov/pubmed/31307371
http://dx.doi.org/10.1186/s12859-019-2973-4
_version_ 1783435556782342144
author de Lima Nichio, Bruno Thiago
de Oliveira, Aryel Marlus Repula
de Pierri, Camilla Reginatto
Santos, Leticia Graziela Costa
Lejambre, Alexandre Quadros
Vialle, Ricardo Assunção
da Rocha Coimbra, Nilson Antônio
Guizelini, Dieval
Marchaukoski, Jeroniza Nunes
de Oliveira Pedrosa, Fabio
Raittz, Roberto Tadeu
author_facet de Lima Nichio, Bruno Thiago
de Oliveira, Aryel Marlus Repula
de Pierri, Camilla Reginatto
Santos, Leticia Graziela Costa
Lejambre, Alexandre Quadros
Vialle, Ricardo Assunção
da Rocha Coimbra, Nilson Antônio
Guizelini, Dieval
Marchaukoski, Jeroniza Nunes
de Oliveira Pedrosa, Fabio
Raittz, Roberto Tadeu
author_sort de Lima Nichio, Bruno Thiago
collection PubMed
description BACKGROUND: Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. RESULTS: Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS(3)G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS(3)G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. CONCLUSION: In general, RAFTS(3)G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS(3)G compared to other “standard-gold” methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS(3)G process. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2973-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6631606
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-66316062019-07-24 RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets de Lima Nichio, Bruno Thiago de Oliveira, Aryel Marlus Repula de Pierri, Camilla Reginatto Santos, Leticia Graziela Costa Lejambre, Alexandre Quadros Vialle, Ricardo Assunção da Rocha Coimbra, Nilson Antônio Guizelini, Dieval Marchaukoski, Jeroniza Nunes de Oliveira Pedrosa, Fabio Raittz, Roberto Tadeu BMC Bioinformatics Software BACKGROUND: Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. RESULTS: Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS(3)G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS(3)G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. CONCLUSION: In general, RAFTS(3)G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS(3)G compared to other “standard-gold” methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS(3)G process. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-2973-4) contains supplementary material, which is available to authorized users. BioMed Central 2019-07-15 /pmc/articles/PMC6631606/ /pubmed/31307371 http://dx.doi.org/10.1186/s12859-019-2973-4 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
de Lima Nichio, Bruno Thiago
de Oliveira, Aryel Marlus Repula
de Pierri, Camilla Reginatto
Santos, Leticia Graziela Costa
Lejambre, Alexandre Quadros
Vialle, Ricardo Assunção
da Rocha Coimbra, Nilson Antônio
Guizelini, Dieval
Marchaukoski, Jeroniza Nunes
de Oliveira Pedrosa, Fabio
Raittz, Roberto Tadeu
RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets
title RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets
title_full RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets
title_fullStr RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets
title_full_unstemmed RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets
title_short RAFTS(3)G: an efficient and versatile clustering software to analyses in large protein datasets
title_sort rafts(3)g: an efficient and versatile clustering software to analyses in large protein datasets
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6631606/
https://www.ncbi.nlm.nih.gov/pubmed/31307371
http://dx.doi.org/10.1186/s12859-019-2973-4
work_keys_str_mv AT delimanichiobrunothiago rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT deoliveiraaryelmarlusrepula rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT depierricamillareginatto rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT santosleticiagrazielacosta rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT lejambrealexandrequadros rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT viallericardoassuncao rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT darochacoimbranilsonantonio rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT guizelinidieval rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT marchaukoskijeronizanunes rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT deoliveirapedrosafabio rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets
AT raittzrobertotadeu rafts3ganefficientandversatileclusteringsoftwaretoanalysesinlargeproteindatasets