Cargando…

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates

Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or tw...

Descripción completa

Detalles Bibliográficos
Autor principal: Press, William H
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/
https://www.ncbi.nlm.nih.gov/pubmed/36712375
http://dx.doi.org/10.1093/pnasnexus/pgac252
_version_ 1784861671056998400
author Press, William H
author_facet Press, William H
author_sort Press, William H
collection PubMed
description Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10(4) barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10(6) barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today’s even commodity-grade Graphics Processing Units (GPUs). With 10(6) barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.
format Online
Article
Text
id pubmed-9802387
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-98023872023-01-26 Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates Press, William H PNAS Nexus Biological, Health, and Medical Sciences Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10(4) barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10(6) barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today’s even commodity-grade Graphics Processing Units (GPUs). With 10(6) barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads. Oxford University Press 2022-11-04 /pmc/articles/PMC9802387/ /pubmed/36712375 http://dx.doi.org/10.1093/pnasnexus/pgac252 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Biological, Health, and Medical Sciences
Press, William H
Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_full Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_fullStr Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_full_unstemmed Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_short Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_sort fast trimer statistics facilitate accurate decoding of large random dna barcode sets even at large sequencing error rates
topic Biological, Health, and Medical Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/
https://www.ncbi.nlm.nih.gov/pubmed/36712375
http://dx.doi.org/10.1093/pnasnexus/pgac252
work_keys_str_mv AT presswilliamh fasttrimerstatisticsfacilitateaccuratedecodingoflargerandomdnabarcodesetsevenatlargesequencingerrorrates