Cargando…

Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists

In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself. It requires downloading multiple public databases with various data structures, and in general, spending days preparing the data before an...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bohar, Balazs, Fazekas, David, Madgwick, Matthew, Csabai, Luca, Olbei, Marton, Korcsmáros, Tamás, Szalay-Beko, Mate
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	F1000 Research Limited 2023
Materias:	Software Tool Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9731172/ https://www.ncbi.nlm.nih.gov/pubmed/36533093 http://dx.doi.org/10.12688/f1000research.52791.3

_version_	1784845850859536384
author	Bohar, Balazs Fazekas, David Madgwick, Matthew Csabai, Luca Olbei, Marton Korcsmáros, Tamás Szalay-Beko, Mate
author_facet	Bohar, Balazs Fazekas, David Madgwick, Matthew Csabai, Luca Olbei, Marton Korcsmáros, Tamás Szalay-Beko, Mate
author_sort	Bohar, Balazs
collection	PubMed
description	In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself. It requires downloading multiple public databases with various data structures, and in general, spending days preparing the data before answering any biological questions. Here, we introduce Sherlock, an open-source, cloud-based big data platform ( https://earlham-sherlock.github.io/) to solve this problem. Sherlock provides a gap-filling way for computational biologists to store, convert, query, share and generate biology data while ultimately streamlining bioinformatics data management. The Sherlock platform offers a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to enable users to analyze, process, query and extract information from extremely complex and large data sets. Furthermore, Sherlock can handle different structured data (interaction, localization, or genomic sequence) from several sources and convert them to a common optimized storage format, for example, the Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and efficiently execute distributed analytical queries on extremely large data files and share datasets between teams. The Sherlock platform is freely available on GitHub, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users can easily and quickly create and work with specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, analytics, integration and collaboration through modern big data technologies.
format	Online Article Text
id	pubmed-9731172
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	F1000 Research Limited
record_format	MEDLINE/PubMed
spelling	pubmed-97311722022-12-15 Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists Bohar, Balazs Fazekas, David Madgwick, Matthew Csabai, Luca Olbei, Marton Korcsmáros, Tamás Szalay-Beko, Mate F1000Res Software Tool Article In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself. It requires downloading multiple public databases with various data structures, and in general, spending days preparing the data before answering any biological questions. Here, we introduce Sherlock, an open-source, cloud-based big data platform ( https://earlham-sherlock.github.io/) to solve this problem. Sherlock provides a gap-filling way for computational biologists to store, convert, query, share and generate biology data while ultimately streamlining bioinformatics data management. The Sherlock platform offers a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to enable users to analyze, process, query and extract information from extremely complex and large data sets. Furthermore, Sherlock can handle different structured data (interaction, localization, or genomic sequence) from several sources and convert them to a common optimized storage format, for example, the Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and efficiently execute distributed analytical queries on extremely large data files and share datasets between teams. The Sherlock platform is freely available on GitHub, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users can easily and quickly create and work with specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, analytics, integration and collaboration through modern big data technologies. F1000 Research Limited 2023-01-12 /pmc/articles/PMC9731172/ /pubmed/36533093 http://dx.doi.org/10.12688/f1000research.52791.3 Text en Copyright: © 2023 Bohar B et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Tool Article Bohar, Balazs Fazekas, David Madgwick, Matthew Csabai, Luca Olbei, Marton Korcsmáros, Tamás Szalay-Beko, Mate Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists
title	Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists
title_full	Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists
title_fullStr	Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists
title_full_unstemmed	Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists
title_short	Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists
title_sort	sherlock: an open-source data platform to store, analyze and integrate big data for computational biologists
topic	Software Tool Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9731172/ https://www.ncbi.nlm.nih.gov/pubmed/36533093 http://dx.doi.org/10.12688/f1000research.52791.3
work_keys_str_mv	AT boharbalazs sherlockanopensourcedataplatformtostoreanalyzeandintegratebigdataforcomputationalbiologists AT fazekasdavid sherlockanopensourcedataplatformtostoreanalyzeandintegratebigdataforcomputationalbiologists AT madgwickmatthew sherlockanopensourcedataplatformtostoreanalyzeandintegratebigdataforcomputationalbiologists AT csabailuca sherlockanopensourcedataplatformtostoreanalyzeandintegratebigdataforcomputationalbiologists AT olbeimarton sherlockanopensourcedataplatformtostoreanalyzeandintegratebigdataforcomputationalbiologists AT korcsmarostamas sherlockanopensourcedataplatformtostoreanalyzeandintegratebigdataforcomputationalbiologists AT szalaybekomate sherlockanopensourcedataplatformtostoreanalyzeandintegratebigdataforcomputationalbiologists

Sherlock: an open-source data platform to store, analyze and integrate Big Data for computational biologists

Ejemplares similares