Cargando…

A machine learning approach for accurate and real-time DNA sequence identification

BACKGROUND: The all-electronic Single Molecule Break Junction (SMBJ) method is an emerging alternative to traditional polymerase chain reaction (PCR) techniques for genetic sequencing and identification. Existing work indicates that the current spectra recorded from SMBJ experimentations contain uni...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Yiren, Alangari, Mashari, Hihath, Joshua, Das, Arindam K., Anantram, M. P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8268518/
https://www.ncbi.nlm.nih.gov/pubmed/34243709
http://dx.doi.org/10.1186/s12864-021-07841-6
_version_ 1783720375901749248
author Wang, Yiren
Alangari, Mashari
Hihath, Joshua
Das, Arindam K.
Anantram, M. P.
author_facet Wang, Yiren
Alangari, Mashari
Hihath, Joshua
Das, Arindam K.
Anantram, M. P.
author_sort Wang, Yiren
collection PubMed
description BACKGROUND: The all-electronic Single Molecule Break Junction (SMBJ) method is an emerging alternative to traditional polymerase chain reaction (PCR) techniques for genetic sequencing and identification. Existing work indicates that the current spectra recorded from SMBJ experimentations contain unique signatures to identify known sequences from a dataset. However, the spectra are typically extremely noisy due to the stochastic and complex interactions between the substrate, sample, environment, and the measuring system, necessitating hundreds or thousands of experimentations to obtain reliable and accurate results. RESULTS: This article presents a DNA sequence identification system based on the current spectra of ten short strand sequences, including a pair that differs by a single mismatch. By employing a gradient boosted tree classifier model trained on conductance histograms, we demonstrate that extremely high accuracy, ranging from approximately 96 % for molecules differing by a single mismatch to 99.5 % otherwise, is possible. Further, such accuracy metrics are achievable in near real-time with just twenty or thirty SMBJ measurements instead of hundreds or thousands. We also demonstrate that a tandem classifier architecture, where the first stage is a multiclass classifier and the second stage is a binary classifier, can be employed to boost the single mismatched pair’s identification accuracy to 99.5 %. CONCLUSIONS: A monolithic classifier, or more generally, a multistage classifier with model specific parameters that depend on experimental current spectra can be used to successfully identify DNA strands. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-021-07841-6.
format Online
Article
Text
id pubmed-8268518
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-82685182021-07-09 A machine learning approach for accurate and real-time DNA sequence identification Wang, Yiren Alangari, Mashari Hihath, Joshua Das, Arindam K. Anantram, M. P. BMC Genomics Methodology Article BACKGROUND: The all-electronic Single Molecule Break Junction (SMBJ) method is an emerging alternative to traditional polymerase chain reaction (PCR) techniques for genetic sequencing and identification. Existing work indicates that the current spectra recorded from SMBJ experimentations contain unique signatures to identify known sequences from a dataset. However, the spectra are typically extremely noisy due to the stochastic and complex interactions between the substrate, sample, environment, and the measuring system, necessitating hundreds or thousands of experimentations to obtain reliable and accurate results. RESULTS: This article presents a DNA sequence identification system based on the current spectra of ten short strand sequences, including a pair that differs by a single mismatch. By employing a gradient boosted tree classifier model trained on conductance histograms, we demonstrate that extremely high accuracy, ranging from approximately 96 % for molecules differing by a single mismatch to 99.5 % otherwise, is possible. Further, such accuracy metrics are achievable in near real-time with just twenty or thirty SMBJ measurements instead of hundreds or thousands. We also demonstrate that a tandem classifier architecture, where the first stage is a multiclass classifier and the second stage is a binary classifier, can be employed to boost the single mismatched pair’s identification accuracy to 99.5 %. CONCLUSIONS: A monolithic classifier, or more generally, a multistage classifier with model specific parameters that depend on experimental current spectra can be used to successfully identify DNA strands. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-021-07841-6. BioMed Central 2021-07-09 /pmc/articles/PMC8268518/ /pubmed/34243709 http://dx.doi.org/10.1186/s12864-021-07841-6 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Wang, Yiren
Alangari, Mashari
Hihath, Joshua
Das, Arindam K.
Anantram, M. P.
A machine learning approach for accurate and real-time DNA sequence identification
title A machine learning approach for accurate and real-time DNA sequence identification
title_full A machine learning approach for accurate and real-time DNA sequence identification
title_fullStr A machine learning approach for accurate and real-time DNA sequence identification
title_full_unstemmed A machine learning approach for accurate and real-time DNA sequence identification
title_short A machine learning approach for accurate and real-time DNA sequence identification
title_sort machine learning approach for accurate and real-time dna sequence identification
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8268518/
https://www.ncbi.nlm.nih.gov/pubmed/34243709
http://dx.doi.org/10.1186/s12864-021-07841-6
work_keys_str_mv AT wangyiren amachinelearningapproachforaccurateandrealtimednasequenceidentification
AT alangarimashari amachinelearningapproachforaccurateandrealtimednasequenceidentification
AT hihathjoshua amachinelearningapproachforaccurateandrealtimednasequenceidentification
AT dasarindamk amachinelearningapproachforaccurateandrealtimednasequenceidentification
AT anantrammp amachinelearningapproachforaccurateandrealtimednasequenceidentification
AT wangyiren machinelearningapproachforaccurateandrealtimednasequenceidentification
AT alangarimashari machinelearningapproachforaccurateandrealtimednasequenceidentification
AT hihathjoshua machinelearningapproachforaccurateandrealtimednasequenceidentification
AT dasarindamk machinelearningapproachforaccurateandrealtimednasequenceidentification
AT anantrammp machinelearningapproachforaccurateandrealtimednasequenceidentification