Cargando…

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutat...

Descripción completa

Detalles Bibliográficos
Autores principales: Oujja, Anas, Abid, Mohamed Riduan, Boumhidi, Jaouad, Bourhnane, Safae, Mourhir, Asmaa, Merchant, Fatima, Benhaddou, Driss
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Korea Genome Organization 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8752974/
https://www.ncbi.nlm.nih.gov/pubmed/35012291
http://dx.doi.org/10.5808/gi.21056
_version_ 1784631991202742272
author Oujja, Anas
Abid, Mohamed Riduan
Boumhidi, Jaouad
Bourhnane, Safae
Mourhir, Asmaa
Merchant, Fatima
Benhaddou, Driss
author_facet Oujja, Anas
Abid, Mohamed Riduan
Boumhidi, Jaouad
Bourhnane, Safae
Mourhir, Asmaa
Merchant, Fatima
Benhaddou, Driss
author_sort Oujja, Anas
collection PubMed
description Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.
format Online
Article
Text
id pubmed-8752974
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Korea Genome Organization
record_format MEDLINE/PubMed
spelling pubmed-87529742022-01-24 High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach Oujja, Anas Abid, Mohamed Riduan Boumhidi, Jaouad Bourhnane, Safae Mourhir, Asmaa Merchant, Fatima Benhaddou, Driss Genomics Inform Application Note Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes. Korea Genome Organization 2021-12-31 /pmc/articles/PMC8752974/ /pubmed/35012291 http://dx.doi.org/10.5808/gi.21056 Text en (c) 2021, Korea Genome Organization https://creativecommons.org/licenses/by/4.0/(CC) This is an open-access article distributed under the terms of the Creative Commons Attribution license(https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Application Note
Oujja, Anas
Abid, Mohamed Riduan
Boumhidi, Jaouad
Bourhnane, Safae
Mourhir, Asmaa
Merchant, Fatima
Benhaddou, Driss
High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach
title High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach
title_full High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach
title_fullStr High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach
title_full_unstemmed High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach
title_short High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach
title_sort high-performance computing for sars-cov-2 rnas clustering: a data science‒based genomics approach
topic Application Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8752974/
https://www.ncbi.nlm.nih.gov/pubmed/35012291
http://dx.doi.org/10.5808/gi.21056
work_keys_str_mv AT oujjaanas highperformancecomputingforsarscov2rnasclusteringadatasciencebasedgenomicsapproach
AT abidmohamedriduan highperformancecomputingforsarscov2rnasclusteringadatasciencebasedgenomicsapproach
AT boumhidijaouad highperformancecomputingforsarscov2rnasclusteringadatasciencebasedgenomicsapproach
AT bourhnanesafae highperformancecomputingforsarscov2rnasclusteringadatasciencebasedgenomicsapproach
AT mourhirasmaa highperformancecomputingforsarscov2rnasclusteringadatasciencebasedgenomicsapproach
AT merchantfatima highperformancecomputingforsarscov2rnasclusteringadatasciencebasedgenomicsapproach
AT benhaddoudriss highperformancecomputingforsarscov2rnasclusteringadatasciencebasedgenomicsapproach