Cargando…

SWeeP: representing large biological sequences datasets in compact vectors

Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here,...

Descripción completa

Detalles Bibliográficos
Autores principales: De Pierri, Camilla Reginatto, Voyceik, Ricardo, Santos de Mattos, Letícia Graziela Costa, Kulik, Mariane Gonçalves, Camargo, Josué Oliveira, Repula de Oliveira, Aryel Marlus, de Lima Nichio, Bruno Thiago, Marchaukoski, Jeroniza Nunes, da Silva Filho, Antonio Camilo, Guizelini, Dieval, Ortega, J. Miguel, Pedrosa, Fabio O., Raittz, Roberto Tadeu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6952362/
https://www.ncbi.nlm.nih.gov/pubmed/31919449
http://dx.doi.org/10.1038/s41598-019-55627-4
_version_ 1783486433494827008
author De Pierri, Camilla Reginatto
Voyceik, Ricardo
Santos de Mattos, Letícia Graziela Costa
Kulik, Mariane Gonçalves
Camargo, Josué Oliveira
Repula de Oliveira, Aryel Marlus
de Lima Nichio, Bruno Thiago
Marchaukoski, Jeroniza Nunes
da Silva Filho, Antonio Camilo
Guizelini, Dieval
Ortega, J. Miguel
Pedrosa, Fabio O.
Raittz, Roberto Tadeu
author_facet De Pierri, Camilla Reginatto
Voyceik, Ricardo
Santos de Mattos, Letícia Graziela Costa
Kulik, Mariane Gonçalves
Camargo, Josué Oliveira
Repula de Oliveira, Aryel Marlus
de Lima Nichio, Bruno Thiago
Marchaukoski, Jeroniza Nunes
da Silva Filho, Antonio Camilo
Guizelini, Dieval
Ortega, J. Miguel
Pedrosa, Fabio O.
Raittz, Roberto Tadeu
author_sort De Pierri, Camilla Reginatto
collection PubMed
description Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.
format Online
Article
Text
id pubmed-6952362
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-69523622020-01-13 SWeeP: representing large biological sequences datasets in compact vectors De Pierri, Camilla Reginatto Voyceik, Ricardo Santos de Mattos, Letícia Graziela Costa Kulik, Mariane Gonçalves Camargo, Josué Oliveira Repula de Oliveira, Aryel Marlus de Lima Nichio, Bruno Thiago Marchaukoski, Jeroniza Nunes da Silva Filho, Antonio Camilo Guizelini, Dieval Ortega, J. Miguel Pedrosa, Fabio O. Raittz, Roberto Tadeu Sci Rep Article Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/. Nature Publishing Group UK 2020-01-09 /pmc/articles/PMC6952362/ /pubmed/31919449 http://dx.doi.org/10.1038/s41598-019-55627-4 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
De Pierri, Camilla Reginatto
Voyceik, Ricardo
Santos de Mattos, Letícia Graziela Costa
Kulik, Mariane Gonçalves
Camargo, Josué Oliveira
Repula de Oliveira, Aryel Marlus
de Lima Nichio, Bruno Thiago
Marchaukoski, Jeroniza Nunes
da Silva Filho, Antonio Camilo
Guizelini, Dieval
Ortega, J. Miguel
Pedrosa, Fabio O.
Raittz, Roberto Tadeu
SWeeP: representing large biological sequences datasets in compact vectors
title SWeeP: representing large biological sequences datasets in compact vectors
title_full SWeeP: representing large biological sequences datasets in compact vectors
title_fullStr SWeeP: representing large biological sequences datasets in compact vectors
title_full_unstemmed SWeeP: representing large biological sequences datasets in compact vectors
title_short SWeeP: representing large biological sequences datasets in compact vectors
title_sort sweep: representing large biological sequences datasets in compact vectors
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6952362/
https://www.ncbi.nlm.nih.gov/pubmed/31919449
http://dx.doi.org/10.1038/s41598-019-55627-4
work_keys_str_mv AT depierricamillareginatto sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT voyceikricardo sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT santosdemattosleticiagrazielacosta sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT kulikmarianegoncalves sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT camargojosueoliveira sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT repuladeoliveiraaryelmarlus sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT delimanichiobrunothiago sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT marchaukoskijeronizanunes sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT dasilvafilhoantoniocamilo sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT guizelinidieval sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT ortegajmiguel sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT pedrosafabioo sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors
AT raittzrobertotadeu sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors