Cargando…

Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences

DNA sequencers output a large set of very long biological data strings that we should persist in databases rather than basic text file systems. Many different data models and database management systems (DBMS) may deal with both storage and efficiency issues regarding genomic datasets. Specifically,...

Descripción completa

Detalles Bibliográficos
Autores principales: Lifschitz, Sergio, Haeusler, Edward H., Catanho, Marcos, de Miranda, Antonio B., Molina de Armas, Elvismary, Heine, Alexandre, Moreira, Sergio G. M. P., Tristão, Cristian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472027/
https://www.ncbi.nlm.nih.gov/pubmed/35997339
http://dx.doi.org/10.3390/biotech11030031
_version_ 1784789216501170176
author Lifschitz, Sergio
Haeusler, Edward H.
Catanho, Marcos
de Miranda, Antonio B.
Molina de Armas, Elvismary
Heine, Alexandre
Moreira, Sergio G. M. P.
Tristão, Cristian
author_facet Lifschitz, Sergio
Haeusler, Edward H.
Catanho, Marcos
de Miranda, Antonio B.
Molina de Armas, Elvismary
Heine, Alexandre
Moreira, Sergio G. M. P.
Tristão, Cristian
author_sort Lifschitz, Sergio
collection PubMed
description DNA sequencers output a large set of very long biological data strings that we should persist in databases rather than basic text file systems. Many different data models and database management systems (DBMS) may deal with both storage and efficiency issues regarding genomic datasets. Specifically, there is a need for handling strings with variable sizes while keeping their biological meaning. Relational database management systems (RDBMS) provide several data types that could be further explored for the genomics context. Besides, they enforce integrity, consistency, and enable good abstractions for more conventional data. We propose the relational text data type to represent and manipulate biological sequences and their derivatives. We present a logical schema for representing the core biological information, which may be inferred from a given biological conceptual data schema and the corresponding function manipulations. We implement and evaluate these stored functions into an actual RDBMS for both efficacy and efficiency. We show that it is possible to enforce basic and complex requirements for the genomic domain. We claim that the well-established relational text data type in RDBMS may appropriately handle the representation and persistency of biological sequences. We base our approach on the idea of domain-specific abstract data types that can store data with semantically defined functions while hiding those details from non-technical end-users.
format Online
Article
Text
id pubmed-9472027
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-94720272022-09-15 Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences Lifschitz, Sergio Haeusler, Edward H. Catanho, Marcos de Miranda, Antonio B. Molina de Armas, Elvismary Heine, Alexandre Moreira, Sergio G. M. P. Tristão, Cristian BioTech (Basel) Article DNA sequencers output a large set of very long biological data strings that we should persist in databases rather than basic text file systems. Many different data models and database management systems (DBMS) may deal with both storage and efficiency issues regarding genomic datasets. Specifically, there is a need for handling strings with variable sizes while keeping their biological meaning. Relational database management systems (RDBMS) provide several data types that could be further explored for the genomics context. Besides, they enforce integrity, consistency, and enable good abstractions for more conventional data. We propose the relational text data type to represent and manipulate biological sequences and their derivatives. We present a logical schema for representing the core biological information, which may be inferred from a given biological conceptual data schema and the corresponding function manipulations. We implement and evaluate these stored functions into an actual RDBMS for both efficacy and efficiency. We show that it is possible to enforce basic and complex requirements for the genomic domain. We claim that the well-established relational text data type in RDBMS may appropriately handle the representation and persistency of biological sequences. We base our approach on the idea of domain-specific abstract data types that can store data with semantically defined functions while hiding those details from non-technical end-users. MDPI 2022-07-30 /pmc/articles/PMC9472027/ /pubmed/35997339 http://dx.doi.org/10.3390/biotech11030031 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Lifschitz, Sergio
Haeusler, Edward H.
Catanho, Marcos
de Miranda, Antonio B.
Molina de Armas, Elvismary
Heine, Alexandre
Moreira, Sergio G. M. P.
Tristão, Cristian
Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences
title Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences
title_full Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences
title_fullStr Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences
title_full_unstemmed Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences
title_short Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences
title_sort bio-strings: a relational database data-type for dealing with large biosequences
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472027/
https://www.ncbi.nlm.nih.gov/pubmed/35997339
http://dx.doi.org/10.3390/biotech11030031
work_keys_str_mv AT lifschitzsergio biostringsarelationaldatabasedatatypefordealingwithlargebiosequences
AT haeusleredwardh biostringsarelationaldatabasedatatypefordealingwithlargebiosequences
AT catanhomarcos biostringsarelationaldatabasedatatypefordealingwithlargebiosequences
AT demirandaantoniob biostringsarelationaldatabasedatatypefordealingwithlargebiosequences
AT molinadearmaselvismary biostringsarelationaldatabasedatatypefordealingwithlargebiosequences
AT heinealexandre biostringsarelationaldatabasedatatypefordealingwithlargebiosequences
AT moreirasergiogmp biostringsarelationaldatabasedatatypefordealingwithlargebiosequences
AT tristaocristian biostringsarelationaldatabasedatatypefordealingwithlargebiosequences