Cargando…

Improving reusability along the data life cycle: a regulatory circuits case study

BACKGROUND: In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipe...

Descripción completa

Detalles Bibliográficos
Autores principales: Louarn, Marine, Chatonnet, Fabrice, Garnier, Xavier, Fest, Thierry, Siegel, Anne, Faron, Catherine, Dameron, Olivier
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8962212/
https://www.ncbi.nlm.nih.gov/pubmed/35346379
http://dx.doi.org/10.1186/s13326-022-00266-4
_version_ 1784677749202354176
author Louarn, Marine
Chatonnet, Fabrice
Garnier, Xavier
Fest, Thierry
Siegel, Anne
Faron, Catherine
Dameron, Olivier
author_facet Louarn, Marine
Chatonnet, Fabrice
Garnier, Xavier
Fest, Thierry
Siegel, Anne
Faron, Catherine
Dameron, Olivier
author_sort Louarn, Marine
collection PubMed
description BACKGROUND: In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies. RESULTS: We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint. CONCLUSION: The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org
format Online
Article
Text
id pubmed-8962212
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-89622122022-03-30 Improving reusability along the data life cycle: a regulatory circuits case study Louarn, Marine Chatonnet, Fabrice Garnier, Xavier Fest, Thierry Siegel, Anne Faron, Catherine Dameron, Olivier J Biomed Semantics Research BACKGROUND: In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies. RESULTS: We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint. CONCLUSION: The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org BioMed Central 2022-03-28 /pmc/articles/PMC8962212/ /pubmed/35346379 http://dx.doi.org/10.1186/s13326-022-00266-4 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Louarn, Marine
Chatonnet, Fabrice
Garnier, Xavier
Fest, Thierry
Siegel, Anne
Faron, Catherine
Dameron, Olivier
Improving reusability along the data life cycle: a regulatory circuits case study
title Improving reusability along the data life cycle: a regulatory circuits case study
title_full Improving reusability along the data life cycle: a regulatory circuits case study
title_fullStr Improving reusability along the data life cycle: a regulatory circuits case study
title_full_unstemmed Improving reusability along the data life cycle: a regulatory circuits case study
title_short Improving reusability along the data life cycle: a regulatory circuits case study
title_sort improving reusability along the data life cycle: a regulatory circuits case study
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8962212/
https://www.ncbi.nlm.nih.gov/pubmed/35346379
http://dx.doi.org/10.1186/s13326-022-00266-4
work_keys_str_mv AT louarnmarine improvingreusabilityalongthedatalifecyclearegulatorycircuitscasestudy
AT chatonnetfabrice improvingreusabilityalongthedatalifecyclearegulatorycircuitscasestudy
AT garnierxavier improvingreusabilityalongthedatalifecyclearegulatorycircuitscasestudy
AT festthierry improvingreusabilityalongthedatalifecyclearegulatorycircuitscasestudy
AT siegelanne improvingreusabilityalongthedatalifecyclearegulatorycircuitscasestudy
AT faroncatherine improvingreusabilityalongthedatalifecyclearegulatorycircuitscasestudy
AT dameronolivier improvingreusabilityalongthedatalifecyclearegulatorycircuitscasestudy