Cargando…

A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses

Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the i...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mrozek, Dariusz, Stępień, Krzysztof, Grzesik, Piotr, Małysiak-Mrozek, Bożena
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8314304/ https://www.ncbi.nlm.nih.gov/pubmed/34326863 http://dx.doi.org/10.3389/fgene.2021.699280

_version_	1783729519453011968
author	Mrozek, Dariusz Stępień, Krzysztof Grzesik, Piotr Małysiak-Mrozek, Bożena
author_facet	Mrozek, Dariusz Stępień, Krzysztof Grzesik, Piotr Małysiak-Mrozek, Bożena
author_sort	Mrozek, Dariusz
collection	PubMed
description	Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses.
format	Online Article Text
id	pubmed-8314304
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-83143042021-07-28 A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses Mrozek, Dariusz Stępień, Krzysztof Grzesik, Piotr Małysiak-Mrozek, Bożena Front Genet Genetics Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses. Frontiers Media S.A. 2021-07-13 /pmc/articles/PMC8314304/ /pubmed/34326863 http://dx.doi.org/10.3389/fgene.2021.699280 Text en Copyright © 2021 Mrozek, Stępień, Grzesik and Małysiak-Mrozek. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Mrozek, Dariusz Stępień, Krzysztof Grzesik, Piotr Małysiak-Mrozek, Bożena A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses
title	A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses
title_full	A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses
title_fullStr	A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses
title_full_unstemmed	A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses
title_short	A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses
title_sort	large-scale and serverless computational approach for improving quality of ngs data supporting big multi-omics data analyses
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8314304/ https://www.ncbi.nlm.nih.gov/pubmed/34326863 http://dx.doi.org/10.3389/fgene.2021.699280
work_keys_str_mv	AT mrozekdariusz alargescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses AT stepienkrzysztof alargescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses AT grzesikpiotr alargescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses AT małysiakmrozekbozena alargescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses AT mrozekdariusz largescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses AT stepienkrzysztof largescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses AT grzesikpiotr largescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses AT małysiakmrozekbozena largescaleandserverlesscomputationalapproachforimprovingqualityofngsdatasupportingbigmultiomicsdataanalyses

A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses

Ejemplares similares