Cargando…

Indexing labeled sequences

BACKGROUND: Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and...

Descripción completa

Detalles Bibliográficos
Autores principales: Rocher, Tatiana, Giraud, Mathieu, Salson, Mikaël
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924554/
https://www.ncbi.nlm.nih.gov/pubmed/33816803
http://dx.doi.org/10.7717/peerj-cs.148
_version_ 1783659113333391360
author Rocher, Tatiana
Giraud, Mathieu
Salson, Mikaël
author_facet Rocher, Tatiana
Giraud, Mathieu
Salson, Mikaël
author_sort Rocher, Tatiana
collection PubMed
description BACKGROUND: Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and brings insight in the lymphocyte repertoire for onco-hematology or immunology studies. METHODS: We present two indexes for a text with non-overlapping labels. They store the text in a Burrows–Wheeler transform (BWT) and a compressed label sequence in a Wavelet Tree. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TL(BW)-index). Both indexes need a space related to the entropy of the labeled text. RESULTS: These indexes allow efficient text–label queries to count and find labeled patterns. The TL(BW)-index has an overhead on simple label queries but is very efficient on combined pattern–label queries. We implemented the indexes in C++ and compared them against a baseline solution on pseudo-random as well as on V(D)J labeled texts. DISCUSSION: New indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies.
format Online
Article
Text
id pubmed-7924554
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-79245542021-04-02 Indexing labeled sequences Rocher, Tatiana Giraud, Mathieu Salson, Mikaël PeerJ Comput Sci Bioinformatics BACKGROUND: Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and brings insight in the lymphocyte repertoire for onco-hematology or immunology studies. METHODS: We present two indexes for a text with non-overlapping labels. They store the text in a Burrows–Wheeler transform (BWT) and a compressed label sequence in a Wavelet Tree. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TL(BW)-index). Both indexes need a space related to the entropy of the labeled text. RESULTS: These indexes allow efficient text–label queries to count and find labeled patterns. The TL(BW)-index has an overhead on simple label queries but is very efficient on combined pattern–label queries. We implemented the indexes in C++ and compared them against a baseline solution on pseudo-random as well as on V(D)J labeled texts. DISCUSSION: New indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies. PeerJ Inc. 2018-03-26 /pmc/articles/PMC7924554/ /pubmed/33816803 http://dx.doi.org/10.7717/peerj-cs.148 Text en © 2018 Rocher et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Rocher, Tatiana
Giraud, Mathieu
Salson, Mikaël
Indexing labeled sequences
title Indexing labeled sequences
title_full Indexing labeled sequences
title_fullStr Indexing labeled sequences
title_full_unstemmed Indexing labeled sequences
title_short Indexing labeled sequences
title_sort indexing labeled sequences
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924554/
https://www.ncbi.nlm.nih.gov/pubmed/33816803
http://dx.doi.org/10.7717/peerj-cs.148
work_keys_str_mv AT rochertatiana indexinglabeledsequences
AT giraudmathieu indexinglabeledsequences
AT salsonmikael indexinglabeledsequences