Cargando…

Named Data Networking for Genomics Data Management and Integrated Workflows

Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ogle, Cameron, Reddick, David, McKnight, Coleman, Biggs, Tyler, Pauly, Rini, Ficklin, Stephen P., Feltus, F. Alex, Shannigrahi, Susmit
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Big Data
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7968724/ https://www.ncbi.nlm.nih.gov/pubmed/33748749 http://dx.doi.org/10.3389/fdata.2021.582468

_version_	1783666121334849536
author	Ogle, Cameron Reddick, David McKnight, Coleman Biggs, Tyler Pauly, Rini Ficklin, Stephen P. Feltus, F. Alex Shannigrahi, Susmit
author_facet	Ogle, Cameron Reddick, David McKnight, Coleman Biggs, Tyler Pauly, Rini Ficklin, Stephen P. Feltus, F. Alex Shannigrahi, Susmit
author_sort	Ogle, Cameron
collection	PubMed
description	Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN—we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.
format	Online Article Text
id	pubmed-7968724
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-79687242021-03-18 Named Data Networking for Genomics Data Management and Integrated Workflows Ogle, Cameron Reddick, David McKnight, Coleman Biggs, Tyler Pauly, Rini Ficklin, Stephen P. Feltus, F. Alex Shannigrahi, Susmit Front Big Data Big Data Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN—we are working on extending and evaluating our pilot deployment and will present systematic results in a future work. Frontiers Media S.A. 2021-02-15 /pmc/articles/PMC7968724/ /pubmed/33748749 http://dx.doi.org/10.3389/fdata.2021.582468 Text en Copyright © 2021 Ogle, Reddick, Mcknight, Biggs, Pauly, Ficklin, Feltus and Shannigrahi. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Big Data Ogle, Cameron Reddick, David McKnight, Coleman Biggs, Tyler Pauly, Rini Ficklin, Stephen P. Feltus, F. Alex Shannigrahi, Susmit Named Data Networking for Genomics Data Management and Integrated Workflows
title	Named Data Networking for Genomics Data Management and Integrated Workflows
title_full	Named Data Networking for Genomics Data Management and Integrated Workflows
title_fullStr	Named Data Networking for Genomics Data Management and Integrated Workflows
title_full_unstemmed	Named Data Networking for Genomics Data Management and Integrated Workflows
title_short	Named Data Networking for Genomics Data Management and Integrated Workflows
title_sort	named data networking for genomics data management and integrated workflows
topic	Big Data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7968724/ https://www.ncbi.nlm.nih.gov/pubmed/33748749 http://dx.doi.org/10.3389/fdata.2021.582468
work_keys_str_mv	AT oglecameron nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows AT reddickdavid nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows AT mcknightcoleman nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows AT biggstyler nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows AT paulyrini nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows AT ficklinstephenp nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows AT feltusfalex nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows AT shannigrahisusmit nameddatanetworkingforgenomicsdatamanagementandintegratedworkflows

Named Data Networking for Genomics Data Management and Integrated Workflows

Ejemplares similares