Cargando…

Field of genes: using Apache Kafka as a bioinformatic data repository

BACKGROUND: Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI’s) Reference Sequence (RefSeq). These repositories must decide in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lawlor, Brendan, Lynch, Richard, Mac Aogáin, Micheál, Walsh, Paul
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5906921/ https://www.ncbi.nlm.nih.gov/pubmed/29635394 http://dx.doi.org/10.1093/gigascience/giy036

_version_	1783315447627644928
author	Lawlor, Brendan Lynch, Richard Mac Aogáin, Micheál Walsh, Paul
author_facet	Lawlor, Brendan Lynch, Richard Mac Aogáin, Micheál Walsh, Paul
author_sort	Lawlor, Brendan
collection	PubMed
description	BACKGROUND: Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI’s) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI’s RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. RESULTS: The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. CONCLUSIONS: Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data.
format	Online Article Text
id	pubmed-5906921
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-59069212018-04-24 Field of genes: using Apache Kafka as a bioinformatic data repository Lawlor, Brendan Lynch, Richard Mac Aogáin, Micheál Walsh, Paul Gigascience Research BACKGROUND: Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI’s) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI’s RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. RESULTS: The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. CONCLUSIONS: Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data. Oxford University Press 2018-04-09 /pmc/articles/PMC5906921/ /pubmed/29635394 http://dx.doi.org/10.1093/gigascience/giy036 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Lawlor, Brendan Lynch, Richard Mac Aogáin, Micheál Walsh, Paul Field of genes: using Apache Kafka as a bioinformatic data repository
title	Field of genes: using Apache Kafka as a bioinformatic data repository
title_full	Field of genes: using Apache Kafka as a bioinformatic data repository
title_fullStr	Field of genes: using Apache Kafka as a bioinformatic data repository
title_full_unstemmed	Field of genes: using Apache Kafka as a bioinformatic data repository
title_short	Field of genes: using Apache Kafka as a bioinformatic data repository
title_sort	field of genes: using apache kafka as a bioinformatic data repository
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5906921/ https://www.ncbi.nlm.nih.gov/pubmed/29635394 http://dx.doi.org/10.1093/gigascience/giy036
work_keys_str_mv	AT lawlorbrendan fieldofgenesusingapachekafkaasabioinformaticdatarepository AT lynchrichard fieldofgenesusingapachekafkaasabioinformaticdatarepository AT macaogainmicheal fieldofgenesusingapachekafkaasabioinformaticdatarepository AT walshpaul fieldofgenesusingapachekafkaasabioinformaticdatarepository

Field of genes: using Apache Kafka as a bioinformatic data repository

Ejemplares similares