Cargando…

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhausti...

Descripción completa

Detalles Bibliográficos
Autores principales: Press, William H., Hawkins, John A., Jones, Stephen K., Schaub, Jeffrey M., Finkelstein, Ilya J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414044/
https://www.ncbi.nlm.nih.gov/pubmed/32675237
http://dx.doi.org/10.1073/pnas.2004821117
_version_ 1783568904045461504
author Press, William H.
Hawkins, John A.
Jones, Stephen K.
Schaub, Jeffrey M.
Finkelstein, Ilya J.
author_facet Press, William H.
Hawkins, John A.
Jones, Stephen K.
Schaub, Jeffrey M.
Finkelstein, Ilya J.
author_sort Press, William H.
collection PubMed
description Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.
format Online
Article
Text
id pubmed-7414044
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-74140442020-08-21 HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints Press, William H. Hawkins, John A. Jones, Stephen K. Schaub, Jeffrey M. Finkelstein, Ilya J. Proc Natl Acad Sci U S A Biological Sciences Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding. National Academy of Sciences 2020-08-04 2020-07-16 /pmc/articles/PMC7414044/ /pubmed/32675237 http://dx.doi.org/10.1073/pnas.2004821117 Text en Copyright © 2020 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Press, William H.
Hawkins, John A.
Jones, Stephen K.
Schaub, Jeffrey M.
Finkelstein, Ilya J.
HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints
title HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints
title_full HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints
title_fullStr HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints
title_full_unstemmed HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints
title_short HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints
title_sort hedges error-correcting code for dna storage corrects indels and allows sequence constraints
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414044/
https://www.ncbi.nlm.nih.gov/pubmed/32675237
http://dx.doi.org/10.1073/pnas.2004821117
work_keys_str_mv AT presswilliamh hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints
AT hawkinsjohna hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints
AT jonesstephenk hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints
AT schaubjeffreym hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints
AT finkelsteinilyaj hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints