Cargando…
SWeeP: representing large biological sequences datasets in compact vectors
Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here,...
Autores principales: | , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6952362/ https://www.ncbi.nlm.nih.gov/pubmed/31919449 http://dx.doi.org/10.1038/s41598-019-55627-4 |
_version_ | 1783486433494827008 |
---|---|
author | De Pierri, Camilla Reginatto Voyceik, Ricardo Santos de Mattos, Letícia Graziela Costa Kulik, Mariane Gonçalves Camargo, Josué Oliveira Repula de Oliveira, Aryel Marlus de Lima Nichio, Bruno Thiago Marchaukoski, Jeroniza Nunes da Silva Filho, Antonio Camilo Guizelini, Dieval Ortega, J. Miguel Pedrosa, Fabio O. Raittz, Roberto Tadeu |
author_facet | De Pierri, Camilla Reginatto Voyceik, Ricardo Santos de Mattos, Letícia Graziela Costa Kulik, Mariane Gonçalves Camargo, Josué Oliveira Repula de Oliveira, Aryel Marlus de Lima Nichio, Bruno Thiago Marchaukoski, Jeroniza Nunes da Silva Filho, Antonio Camilo Guizelini, Dieval Ortega, J. Miguel Pedrosa, Fabio O. Raittz, Roberto Tadeu |
author_sort | De Pierri, Camilla Reginatto |
collection | PubMed |
description | Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/. |
format | Online Article Text |
id | pubmed-6952362 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-69523622020-01-13 SWeeP: representing large biological sequences datasets in compact vectors De Pierri, Camilla Reginatto Voyceik, Ricardo Santos de Mattos, Letícia Graziela Costa Kulik, Mariane Gonçalves Camargo, Josué Oliveira Repula de Oliveira, Aryel Marlus de Lima Nichio, Bruno Thiago Marchaukoski, Jeroniza Nunes da Silva Filho, Antonio Camilo Guizelini, Dieval Ortega, J. Miguel Pedrosa, Fabio O. Raittz, Roberto Tadeu Sci Rep Article Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/. Nature Publishing Group UK 2020-01-09 /pmc/articles/PMC6952362/ /pubmed/31919449 http://dx.doi.org/10.1038/s41598-019-55627-4 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. |
spellingShingle | Article De Pierri, Camilla Reginatto Voyceik, Ricardo Santos de Mattos, Letícia Graziela Costa Kulik, Mariane Gonçalves Camargo, Josué Oliveira Repula de Oliveira, Aryel Marlus de Lima Nichio, Bruno Thiago Marchaukoski, Jeroniza Nunes da Silva Filho, Antonio Camilo Guizelini, Dieval Ortega, J. Miguel Pedrosa, Fabio O. Raittz, Roberto Tadeu SWeeP: representing large biological sequences datasets in compact vectors |
title | SWeeP: representing large biological sequences datasets in compact vectors |
title_full | SWeeP: representing large biological sequences datasets in compact vectors |
title_fullStr | SWeeP: representing large biological sequences datasets in compact vectors |
title_full_unstemmed | SWeeP: representing large biological sequences datasets in compact vectors |
title_short | SWeeP: representing large biological sequences datasets in compact vectors |
title_sort | sweep: representing large biological sequences datasets in compact vectors |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6952362/ https://www.ncbi.nlm.nih.gov/pubmed/31919449 http://dx.doi.org/10.1038/s41598-019-55627-4 |
work_keys_str_mv | AT depierricamillareginatto sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT voyceikricardo sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT santosdemattosleticiagrazielacosta sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT kulikmarianegoncalves sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT camargojosueoliveira sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT repuladeoliveiraaryelmarlus sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT delimanichiobrunothiago sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT marchaukoskijeronizanunes sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT dasilvafilhoantoniocamilo sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT guizelinidieval sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT ortegajmiguel sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT pedrosafabioo sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors AT raittzrobertotadeu sweeprepresentinglargebiologicalsequencesdatasetsincompactvectors |