Cargando…

Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing

Third-generation DNA sequencers provided by Oxford Nanopore Technologies (ONT) produce a series of samples of an electrical current in the nanopore. Such a time series is used to detect the sequence of nucleotides. The task of translation of current values into nucleotide symbols is called basecalli...

Descripción completa

Detalles Bibliográficos
Autores principales: Napieralski, Adam, Nowak, Robert
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8954548/
https://www.ncbi.nlm.nih.gov/pubmed/35336445
http://dx.doi.org/10.3390/s22062275
_version_ 1784676120341250048
author Napieralski, Adam
Nowak, Robert
author_facet Napieralski, Adam
Nowak, Robert
author_sort Napieralski, Adam
collection PubMed
description Third-generation DNA sequencers provided by Oxford Nanopore Technologies (ONT) produce a series of samples of an electrical current in the nanopore. Such a time series is used to detect the sequence of nucleotides. The task of translation of current values into nucleotide symbols is called basecalling. Various solutions for basecalling have already been proposed. The earlier ones were based on Hidden Markov Models, but the best ones use neural networks or other machine learning models. Unfortunately, achieved accuracy scores are still lower than competitive sequencing techniques, like Illumina’s. Basecallers differ in the input data type—currently, most of them work on a raw data straight from the sequencer (time series of current). Still, the approach of using event data is also explored. Event data is obtained by preprocessing of raw data and dividing it into segments described by several features computed from raw data values within each segment. We propose a novel basecaller that uses joint processing of raw and event data. We define basecalling as a sequence-to-sequence translation, and we use a machine learning model based on an encoder–decoder architecture of recurrent neural networks. Our model incorporates twin encoders and an attention mechanism. We tested our solution on simulated and real datasets. We compare the full model accuracy results with its components: processing only raw or event data. We compare our solution with the existing ONT basecaller—Guppy. Results of numerical experiments show that joint raw and event data processing provides better basecalling accuracy than processing each data type separately. We implement an application called Ravvent, freely available under MIT licence.
format Online
Article
Text
id pubmed-8954548
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-89545482022-03-26 Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing Napieralski, Adam Nowak, Robert Sensors (Basel) Article Third-generation DNA sequencers provided by Oxford Nanopore Technologies (ONT) produce a series of samples of an electrical current in the nanopore. Such a time series is used to detect the sequence of nucleotides. The task of translation of current values into nucleotide symbols is called basecalling. Various solutions for basecalling have already been proposed. The earlier ones were based on Hidden Markov Models, but the best ones use neural networks or other machine learning models. Unfortunately, achieved accuracy scores are still lower than competitive sequencing techniques, like Illumina’s. Basecallers differ in the input data type—currently, most of them work on a raw data straight from the sequencer (time series of current). Still, the approach of using event data is also explored. Event data is obtained by preprocessing of raw data and dividing it into segments described by several features computed from raw data values within each segment. We propose a novel basecaller that uses joint processing of raw and event data. We define basecalling as a sequence-to-sequence translation, and we use a machine learning model based on an encoder–decoder architecture of recurrent neural networks. Our model incorporates twin encoders and an attention mechanism. We tested our solution on simulated and real datasets. We compare the full model accuracy results with its components: processing only raw or event data. We compare our solution with the existing ONT basecaller—Guppy. Results of numerical experiments show that joint raw and event data processing provides better basecalling accuracy than processing each data type separately. We implement an application called Ravvent, freely available under MIT licence. MDPI 2022-03-15 /pmc/articles/PMC8954548/ /pubmed/35336445 http://dx.doi.org/10.3390/s22062275 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Napieralski, Adam
Nowak, Robert
Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing
title Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing
title_full Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing
title_fullStr Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing
title_full_unstemmed Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing
title_short Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing
title_sort basecalling using joint raw and event nanopore data sequence-to-sequence processing
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8954548/
https://www.ncbi.nlm.nih.gov/pubmed/35336445
http://dx.doi.org/10.3390/s22062275
work_keys_str_mv AT napieralskiadam basecallingusingjointrawandeventnanoporedatasequencetosequenceprocessing
AT nowakrobert basecallingusingjointrawandeventnanoporedatasequencetosequenceprocessing