Cargando…

Learning supervised embeddings for large scale sequence comparisons

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment...

Descripción completa

Detalles Bibliográficos
Autores principales: Kimothi, Dhananjay, Biyani, Pravesh, Hogan, James M., Soni, Akshay, Kelly, Wayne
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7069636/
https://www.ncbi.nlm.nih.gov/pubmed/32168338
http://dx.doi.org/10.1371/journal.pone.0216636
_version_ 1783505816003805184
author Kimothi, Dhananjay
Biyani, Pravesh
Hogan, James M.
Soni, Akshay
Kelly, Wayne
author_facet Kimothi, Dhananjay
Biyani, Pravesh
Hogan, James M.
Soni, Akshay
Kelly, Wayne
author_sort Kimothi, Dhananjay
collection PubMed
description Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches—SuperVec and SuperVecX—to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.
format Online
Article
Text
id pubmed-7069636
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-70696362020-03-23 Learning supervised embeddings for large scale sequence comparisons Kimothi, Dhananjay Biyani, Pravesh Hogan, James M. Soni, Akshay Kelly, Wayne PLoS One Research Article Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce two supervised approaches—SuperVec and SuperVecX—to learn sequence embeddings. These methods extend earlier Representation Learning (RepL) based methods to include class-related information for each sequence during training. Including class information ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. We show the quality of the embeddings learned through these methods on (i) sequence retrieval and (ii) classification tasks. We also propose an hierarchical tree-based approach specifically designed for the sequence retrieval problem. The resulting methods, which we term H-SuperVec or H-SuperVecX, according to their respective use of SuperVec or SuperVecX, learn embeddings across a range of feature spaces based on exclusive and exhaustive subsets of the class labels. Experiments show that the proposed methods perform better for retrieval and classification tasks over existing (unsupervised) RepL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches that rapidly filter the collection so that only potentially relevant records remain. Such filtering of the original database allows slower but more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before. Public Library of Science 2020-03-13 /pmc/articles/PMC7069636/ /pubmed/32168338 http://dx.doi.org/10.1371/journal.pone.0216636 Text en © 2020 Kimothi et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Kimothi, Dhananjay
Biyani, Pravesh
Hogan, James M.
Soni, Akshay
Kelly, Wayne
Learning supervised embeddings for large scale sequence comparisons
title Learning supervised embeddings for large scale sequence comparisons
title_full Learning supervised embeddings for large scale sequence comparisons
title_fullStr Learning supervised embeddings for large scale sequence comparisons
title_full_unstemmed Learning supervised embeddings for large scale sequence comparisons
title_short Learning supervised embeddings for large scale sequence comparisons
title_sort learning supervised embeddings for large scale sequence comparisons
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7069636/
https://www.ncbi.nlm.nih.gov/pubmed/32168338
http://dx.doi.org/10.1371/journal.pone.0216636
work_keys_str_mv AT kimothidhananjay learningsupervisedembeddingsforlargescalesequencecomparisons
AT biyanipravesh learningsupervisedembeddingsforlargescalesequencecomparisons
AT hoganjamesm learningsupervisedembeddingsforlargescalesequencecomparisons
AT soniakshay learningsupervisedembeddingsforlargescalesequencecomparisons
AT kellywayne learningsupervisedembeddingsforlargescalesequencecomparisons