Cargando…

RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes

Summary: Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands...

Descripción completa

Detalles Bibliográficos
Autores principales: Firtina, Can, Mansouri Ghiasi, Nika, Lindegger, Joel, Singh, Gagandeep, Cavlak, Meryem Banu, Mao, Haiyu, Mutlu, Onur
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311405/
https://www.ncbi.nlm.nih.gov/pubmed/37387139
http://dx.doi.org/10.1093/bioinformatics/btad272
_version_ 1785066733586874368
author Firtina, Can
Mansouri Ghiasi, Nika
Lindegger, Joel
Singh, Gagandeep
Cavlak, Meryem Banu
Mao, Haiyu
Mutlu, Onur
author_facet Firtina, Can
Mansouri Ghiasi, Nika
Lindegger, Joel
Singh, Gagandeep
Cavlak, Meryem Banu
Mao, Haiyu
Mutlu, Onur
author_sort Firtina, Can
collection PubMed
description Summary: Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either (i) require powerful computational resources that may not be available for portable sequencers or (ii) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides (i) [Formula: see text] and [Formula: see text] better average throughput and (ii) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.
format Online
Article
Text
id pubmed-10311405
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103114052023-07-01 RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes Firtina, Can Mansouri Ghiasi, Nika Lindegger, Joel Singh, Gagandeep Cavlak, Meryem Banu Mao, Haiyu Mutlu, Onur Bioinformatics Genome Sequence Analysis Summary: Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either (i) require powerful computational resources that may not be available for portable sequencers or (ii) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides (i) [Formula: see text] and [Formula: see text] better average throughput and (ii) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash. Oxford University Press 2023-06-30 /pmc/articles/PMC10311405/ /pubmed/37387139 http://dx.doi.org/10.1093/bioinformatics/btad272 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Genome Sequence Analysis
Firtina, Can
Mansouri Ghiasi, Nika
Lindegger, Joel
Singh, Gagandeep
Cavlak, Meryem Banu
Mao, Haiyu
Mutlu, Onur
RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes
title RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes
title_full RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes
title_fullStr RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes
title_full_unstemmed RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes
title_short RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes
title_sort rawhash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes
topic Genome Sequence Analysis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311405/
https://www.ncbi.nlm.nih.gov/pubmed/37387139
http://dx.doi.org/10.1093/bioinformatics/btad272
work_keys_str_mv AT firtinacan rawhashenablingfastandaccuraterealtimeanalysisofrawnanoporesignalsforlargegenomes
AT mansourighiasinika rawhashenablingfastandaccuraterealtimeanalysisofrawnanoporesignalsforlargegenomes
AT lindeggerjoel rawhashenablingfastandaccuraterealtimeanalysisofrawnanoporesignalsforlargegenomes
AT singhgagandeep rawhashenablingfastandaccuraterealtimeanalysisofrawnanoporesignalsforlargegenomes
AT cavlakmeryembanu rawhashenablingfastandaccuraterealtimeanalysisofrawnanoporesignalsforlargegenomes
AT maohaiyu rawhashenablingfastandaccuraterealtimeanalysisofrawnanoporesignalsforlargegenomes
AT mutluonur rawhashenablingfastandaccuraterealtimeanalysisofrawnanoporesignalsforlargegenomes