Cargando…
Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences
DNA sequencers output a large set of very long biological data strings that we should persist in databases rather than basic text file systems. Many different data models and database management systems (DBMS) may deal with both storage and efficiency issues regarding genomic datasets. Specifically,...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472027/ https://www.ncbi.nlm.nih.gov/pubmed/35997339 http://dx.doi.org/10.3390/biotech11030031 |
_version_ | 1784789216501170176 |
---|---|
author | Lifschitz, Sergio Haeusler, Edward H. Catanho, Marcos de Miranda, Antonio B. Molina de Armas, Elvismary Heine, Alexandre Moreira, Sergio G. M. P. Tristão, Cristian |
author_facet | Lifschitz, Sergio Haeusler, Edward H. Catanho, Marcos de Miranda, Antonio B. Molina de Armas, Elvismary Heine, Alexandre Moreira, Sergio G. M. P. Tristão, Cristian |
author_sort | Lifschitz, Sergio |
collection | PubMed |
description | DNA sequencers output a large set of very long biological data strings that we should persist in databases rather than basic text file systems. Many different data models and database management systems (DBMS) may deal with both storage and efficiency issues regarding genomic datasets. Specifically, there is a need for handling strings with variable sizes while keeping their biological meaning. Relational database management systems (RDBMS) provide several data types that could be further explored for the genomics context. Besides, they enforce integrity, consistency, and enable good abstractions for more conventional data. We propose the relational text data type to represent and manipulate biological sequences and their derivatives. We present a logical schema for representing the core biological information, which may be inferred from a given biological conceptual data schema and the corresponding function manipulations. We implement and evaluate these stored functions into an actual RDBMS for both efficacy and efficiency. We show that it is possible to enforce basic and complex requirements for the genomic domain. We claim that the well-established relational text data type in RDBMS may appropriately handle the representation and persistency of biological sequences. We base our approach on the idea of domain-specific abstract data types that can store data with semantically defined functions while hiding those details from non-technical end-users. |
format | Online Article Text |
id | pubmed-9472027 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-94720272022-09-15 Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences Lifschitz, Sergio Haeusler, Edward H. Catanho, Marcos de Miranda, Antonio B. Molina de Armas, Elvismary Heine, Alexandre Moreira, Sergio G. M. P. Tristão, Cristian BioTech (Basel) Article DNA sequencers output a large set of very long biological data strings that we should persist in databases rather than basic text file systems. Many different data models and database management systems (DBMS) may deal with both storage and efficiency issues regarding genomic datasets. Specifically, there is a need for handling strings with variable sizes while keeping their biological meaning. Relational database management systems (RDBMS) provide several data types that could be further explored for the genomics context. Besides, they enforce integrity, consistency, and enable good abstractions for more conventional data. We propose the relational text data type to represent and manipulate biological sequences and their derivatives. We present a logical schema for representing the core biological information, which may be inferred from a given biological conceptual data schema and the corresponding function manipulations. We implement and evaluate these stored functions into an actual RDBMS for both efficacy and efficiency. We show that it is possible to enforce basic and complex requirements for the genomic domain. We claim that the well-established relational text data type in RDBMS may appropriately handle the representation and persistency of biological sequences. We base our approach on the idea of domain-specific abstract data types that can store data with semantically defined functions while hiding those details from non-technical end-users. MDPI 2022-07-30 /pmc/articles/PMC9472027/ /pubmed/35997339 http://dx.doi.org/10.3390/biotech11030031 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Lifschitz, Sergio Haeusler, Edward H. Catanho, Marcos de Miranda, Antonio B. Molina de Armas, Elvismary Heine, Alexandre Moreira, Sergio G. M. P. Tristão, Cristian Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences |
title | Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences |
title_full | Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences |
title_fullStr | Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences |
title_full_unstemmed | Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences |
title_short | Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences |
title_sort | bio-strings: a relational database data-type for dealing with large biosequences |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9472027/ https://www.ncbi.nlm.nih.gov/pubmed/35997339 http://dx.doi.org/10.3390/biotech11030031 |
work_keys_str_mv | AT lifschitzsergio biostringsarelationaldatabasedatatypefordealingwithlargebiosequences AT haeusleredwardh biostringsarelationaldatabasedatatypefordealingwithlargebiosequences AT catanhomarcos biostringsarelationaldatabasedatatypefordealingwithlargebiosequences AT demirandaantoniob biostringsarelationaldatabasedatatypefordealingwithlargebiosequences AT molinadearmaselvismary biostringsarelationaldatabasedatatypefordealingwithlargebiosequences AT heinealexandre biostringsarelationaldatabasedatatypefordealingwithlargebiosequences AT moreirasergiogmp biostringsarelationaldatabasedatatypefordealingwithlargebiosequences AT tristaocristian biostringsarelationaldatabasedatatypefordealingwithlargebiosequences |