Cargando…
Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or tw...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/ https://www.ncbi.nlm.nih.gov/pubmed/36712375 http://dx.doi.org/10.1093/pnasnexus/pgac252 |
_version_ | 1784861671056998400 |
---|---|
author | Press, William H |
author_facet | Press, William H |
author_sort | Press, William H |
collection | PubMed |
description | Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10(4) barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10(6) barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today’s even commodity-grade Graphics Processing Units (GPUs). With 10(6) barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads. |
format | Online Article Text |
id | pubmed-9802387 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-98023872023-01-26 Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates Press, William H PNAS Nexus Biological, Health, and Medical Sciences Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10(4) barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10(6) barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today’s even commodity-grade Graphics Processing Units (GPUs). With 10(6) barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads. Oxford University Press 2022-11-04 /pmc/articles/PMC9802387/ /pubmed/36712375 http://dx.doi.org/10.1093/pnasnexus/pgac252 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Biological, Health, and Medical Sciences Press, William H Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates |
title | Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates |
title_full | Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates |
title_fullStr | Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates |
title_full_unstemmed | Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates |
title_short | Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates |
title_sort | fast trimer statistics facilitate accurate decoding of large random dna barcode sets even at large sequencing error rates |
topic | Biological, Health, and Medical Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/ https://www.ncbi.nlm.nih.gov/pubmed/36712375 http://dx.doi.org/10.1093/pnasnexus/pgac252 |
work_keys_str_mv | AT presswilliamh fasttrimerstatisticsfacilitateaccuratedecodingoflargerandomdnabarcodesetsevenatlargesequencingerrorrates |