Cargando…

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates

Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or tw...

Descripción completa

Detalles Bibliográficos
Autor principal:	Press, William H
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Biological, Health, and Medical Sciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/ https://www.ncbi.nlm.nih.gov/pubmed/36712375 http://dx.doi.org/10.1093/pnasnexus/pgac252

_version_	1784861671056998400
author	Press, William H
author_facet	Press, William H
author_sort	Press, William H
collection	PubMed
description	Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10(4) barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10(6) barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today’s even commodity-grade Graphics Processing Units (GPUs). With 10(6) barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.
format	Online Article Text
id	pubmed-9802387
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98023872023-01-26 Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates Press, William H PNAS Nexus Biological, Health, and Medical Sciences Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10(4) barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10(6) barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today’s even commodity-grade Graphics Processing Units (GPUs). With 10(6) barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads. Oxford University Press 2022-11-04 /pmc/articles/PMC9802387/ /pubmed/36712375 http://dx.doi.org/10.1093/pnasnexus/pgac252 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Biological, Health, and Medical Sciences Press, William H Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title	Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_full	Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_fullStr	Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_full_unstemmed	Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_short	Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates
title_sort	fast trimer statistics facilitate accurate decoding of large random dna barcode sets even at large sequencing error rates
topic	Biological, Health, and Medical Sciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802387/ https://www.ncbi.nlm.nih.gov/pubmed/36712375 http://dx.doi.org/10.1093/pnasnexus/pgac252
work_keys_str_mv	AT presswilliamh fasttrimerstatisticsfacilitateaccuratedecodingoflargerandomdnabarcodesetsevenatlargesequencingerrorrates

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates

Ejemplares similares