Cargando…

TopFed: TCGA tailored federated query processing and linking to LOD

BACKGROUD: The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Saleem, Muhammad, Padmanabhuni, Shanmukha S, Ngomo, Axel-Cyrille Ngonga, Iqbal, Aftab, Almeida, Jonas S, Decker, Stefan, Deus, Helena F
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4417511/ https://www.ncbi.nlm.nih.gov/pubmed/25937882 http://dx.doi.org/10.1186/2041-1480-5-47

_version_	1782369367846027264
author	Saleem, Muhammad Padmanabhuni, Shanmukha S Ngomo, Axel-Cyrille Ngonga Iqbal, Aftab Almeida, Jonas S Decker, Stefan Deus, Helena F
author_facet	Saleem, Muhammad Padmanabhuni, Shanmukha S Ngomo, Axel-Cyrille Ngonga Iqbal, Aftab Almeida, Jonas S Decker, Stefan Deus, Helena F
author_sort	Saleem, Muhammad
collection	PubMed
description	BACKGROUD: The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis. METHODS: We address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed. RESULTS: We compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX. CONCLUSION: With TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.
format	Online Article Text
id	pubmed-4417511
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44175112015-05-04 TopFed: TCGA tailored federated query processing and linking to LOD Saleem, Muhammad Padmanabhuni, Shanmukha S Ngomo, Axel-Cyrille Ngonga Iqbal, Aftab Almeida, Jonas S Decker, Stefan Deus, Helena F J Biomed Semantics Research BACKGROUD: The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis. METHODS: We address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed. RESULTS: We compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX. CONCLUSION: With TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing. BioMed Central 2014-12-03 /pmc/articles/PMC4417511/ /pubmed/25937882 http://dx.doi.org/10.1186/2041-1480-5-47 Text en © Saleem et al.; licensee BioMed Central. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Saleem, Muhammad Padmanabhuni, Shanmukha S Ngomo, Axel-Cyrille Ngonga Iqbal, Aftab Almeida, Jonas S Decker, Stefan Deus, Helena F TopFed: TCGA tailored federated query processing and linking to LOD
title	TopFed: TCGA tailored federated query processing and linking to LOD
title_full	TopFed: TCGA tailored federated query processing and linking to LOD
title_fullStr	TopFed: TCGA tailored federated query processing and linking to LOD
title_full_unstemmed	TopFed: TCGA tailored federated query processing and linking to LOD
title_short	TopFed: TCGA tailored federated query processing and linking to LOD
title_sort	topfed: tcga tailored federated query processing and linking to lod
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4417511/ https://www.ncbi.nlm.nih.gov/pubmed/25937882 http://dx.doi.org/10.1186/2041-1480-5-47
work_keys_str_mv	AT saleemmuhammad topfedtcgatailoredfederatedqueryprocessingandlinkingtolod AT padmanabhunishanmukhas topfedtcgatailoredfederatedqueryprocessingandlinkingtolod AT ngomoaxelcyrillengonga topfedtcgatailoredfederatedqueryprocessingandlinkingtolod AT iqbalaftab topfedtcgatailoredfederatedqueryprocessingandlinkingtolod AT almeidajonass topfedtcgatailoredfederatedqueryprocessingandlinkingtolod AT deckerstefan topfedtcgatailoredfederatedqueryprocessingandlinkingtolod AT deushelenaf topfedtcgatailoredfederatedqueryprocessingandlinkingtolod

TopFed: TCGA tailored federated query processing and linking to LOD

Ejemplares similares