Cargando…

BioFed: federated query processing over life sciences linked open data

BACKGROUND: Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hasnain, Ali, Mehmood, Qaiser, Sana e Zainab, Syeda, Saleem, Muhammad, Warren, Claude, Zehra, Durre, Decker, Stefan, Rebholz-Schuhmann, Dietrich
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5353896/ https://www.ncbi.nlm.nih.gov/pubmed/28298238 http://dx.doi.org/10.1186/s13326-017-0118-0

_version_	1782515225778454528
author	Hasnain, Ali Mehmood, Qaiser Sana e Zainab, Syeda Saleem, Muhammad Warren, Claude Zehra, Durre Decker, Stefan Rebholz-Schuhmann, Dietrich
author_facet	Hasnain, Ali Mehmood, Qaiser Sana e Zainab, Syeda Saleem, Muhammad Warren, Claude Zehra, Durre Decker, Stefan Rebholz-Schuhmann, Dietrich
author_sort	Hasnain, Ali
collection	PubMed
description	BACKGROUND: Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain. METHODS: The efficient cataloguing approach of the federated query processing system ’BioFed’, the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider). RESULTS: BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint’s availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection. CONCLUSION: Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms.
format	Online Article Text
id	pubmed-5353896
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53538962017-03-22 BioFed: federated query processing over life sciences linked open data Hasnain, Ali Mehmood, Qaiser Sana e Zainab, Syeda Saleem, Muhammad Warren, Claude Zehra, Durre Decker, Stefan Rebholz-Schuhmann, Dietrich J Biomed Semantics Research BACKGROUND: Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain. METHODS: The efficient cataloguing approach of the federated query processing system ’BioFed’, the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider). RESULTS: BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint’s availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection. CONCLUSION: Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms. BioMed Central 2017-03-15 /pmc/articles/PMC5353896/ /pubmed/28298238 http://dx.doi.org/10.1186/s13326-017-0118-0 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Hasnain, Ali Mehmood, Qaiser Sana e Zainab, Syeda Saleem, Muhammad Warren, Claude Zehra, Durre Decker, Stefan Rebholz-Schuhmann, Dietrich BioFed: federated query processing over life sciences linked open data
title	BioFed: federated query processing over life sciences linked open data
title_full	BioFed: federated query processing over life sciences linked open data
title_fullStr	BioFed: federated query processing over life sciences linked open data
title_full_unstemmed	BioFed: federated query processing over life sciences linked open data
title_short	BioFed: federated query processing over life sciences linked open data
title_sort	biofed: federated query processing over life sciences linked open data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5353896/ https://www.ncbi.nlm.nih.gov/pubmed/28298238 http://dx.doi.org/10.1186/s13326-017-0118-0
work_keys_str_mv	AT hasnainali biofedfederatedqueryprocessingoverlifescienceslinkedopendata AT mehmoodqaiser biofedfederatedqueryprocessingoverlifescienceslinkedopendata AT sanaezainabsyeda biofedfederatedqueryprocessingoverlifescienceslinkedopendata AT saleemmuhammad biofedfederatedqueryprocessingoverlifescienceslinkedopendata AT warrenclaude biofedfederatedqueryprocessingoverlifescienceslinkedopendata AT zehradurre biofedfederatedqueryprocessingoverlifescienceslinkedopendata AT deckerstefan biofedfederatedqueryprocessingoverlifescienceslinkedopendata AT rebholzschuhmanndietrich biofedfederatedqueryprocessingoverlifescienceslinkedopendata

BioFed: federated query processing over life sciences linked open data

Ejemplares similares