Cargando…
HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints
Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhausti...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414044/ https://www.ncbi.nlm.nih.gov/pubmed/32675237 http://dx.doi.org/10.1073/pnas.2004821117 |
_version_ | 1783568904045461504 |
---|---|
author | Press, William H. Hawkins, John A. Jones, Stephen K. Schaub, Jeffrey M. Finkelstein, Ilya J. |
author_facet | Press, William H. Hawkins, John A. Jones, Stephen K. Schaub, Jeffrey M. Finkelstein, Ilya J. |
author_sort | Press, William H. |
collection | PubMed |
description | Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding. |
format | Online Article Text |
id | pubmed-7414044 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | National Academy of Sciences |
record_format | MEDLINE/PubMed |
spelling | pubmed-74140442020-08-21 HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints Press, William H. Hawkins, John A. Jones, Stephen K. Schaub, Jeffrey M. Finkelstein, Ilya J. Proc Natl Acad Sci U S A Biological Sciences Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding. National Academy of Sciences 2020-08-04 2020-07-16 /pmc/articles/PMC7414044/ /pubmed/32675237 http://dx.doi.org/10.1073/pnas.2004821117 Text en Copyright © 2020 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) . |
spellingShingle | Biological Sciences Press, William H. Hawkins, John A. Jones, Stephen K. Schaub, Jeffrey M. Finkelstein, Ilya J. HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints |
title | HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints |
title_full | HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints |
title_fullStr | HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints |
title_full_unstemmed | HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints |
title_short | HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints |
title_sort | hedges error-correcting code for dna storage corrects indels and allows sequence constraints |
topic | Biological Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414044/ https://www.ncbi.nlm.nih.gov/pubmed/32675237 http://dx.doi.org/10.1073/pnas.2004821117 |
work_keys_str_mv | AT presswilliamh hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints AT hawkinsjohna hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints AT jonesstephenk hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints AT schaubjeffreym hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints AT finkelsteinilyaj hedgeserrorcorrectingcodefordnastoragecorrectsindelsandallowssequenceconstraints |