Cargando…

Indel-correcting DNA barcodes for high-throughput sequencing

Many large-scale, high-throughput experiments use DNA barcodes, short DNA sequences prepended to DNA libraries, for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to s...

Descripción completa

Detalles Bibliográficos
Autores principales: Hawkins, John A., Jones, Stephen K., Finkelstein, Ilya J., Press, William H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6142223/
https://www.ncbi.nlm.nih.gov/pubmed/29925596
http://dx.doi.org/10.1073/pnas.1802640115
_version_ 1783355826965053440
author Hawkins, John A.
Jones, Stephen K.
Finkelstein, Ilya J.
Press, William H.
author_facet Hawkins, John A.
Jones, Stephen K.
Finkelstein, Ilya J.
Press, William H.
author_sort Hawkins, John A.
collection PubMed
description Many large-scale, high-throughput experiments use DNA barcodes, short DNA sequences prepended to DNA libraries, for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results. Widely used error-correcting codes borrowed from computer science (e.g., Hamming, Levenshtein codes) do not properly account for insertions and deletions (indels) in DNA barcodes, even though deletions are the most common type of synthesis error. Here, we present and experimentally validate filled/truncated right end edit (FREE) barcodes, which correct substitution, insertion, and deletion errors, even when these errors alter the barcode length. FREE barcodes are designed with experimental considerations in mind, including balanced guanine-cytosine (GC) content, minimal homopolymer runs, and reduced internal hairpin propensity. We generate and include lists of barcodes with different lengths and error correction levels that may be useful in diverse high-throughput applications, including >10(6) single-error–correcting 16-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space combinatorially, generating lists with >10(15) error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community.
format Online
Article
Text
id pubmed-6142223
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-61422232018-09-19 Indel-correcting DNA barcodes for high-throughput sequencing Hawkins, John A. Jones, Stephen K. Finkelstein, Ilya J. Press, William H. Proc Natl Acad Sci U S A PNAS Plus Many large-scale, high-throughput experiments use DNA barcodes, short DNA sequences prepended to DNA libraries, for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results. Widely used error-correcting codes borrowed from computer science (e.g., Hamming, Levenshtein codes) do not properly account for insertions and deletions (indels) in DNA barcodes, even though deletions are the most common type of synthesis error. Here, we present and experimentally validate filled/truncated right end edit (FREE) barcodes, which correct substitution, insertion, and deletion errors, even when these errors alter the barcode length. FREE barcodes are designed with experimental considerations in mind, including balanced guanine-cytosine (GC) content, minimal homopolymer runs, and reduced internal hairpin propensity. We generate and include lists of barcodes with different lengths and error correction levels that may be useful in diverse high-throughput applications, including >10(6) single-error–correcting 16-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space combinatorially, generating lists with >10(15) error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community. National Academy of Sciences 2018-07-03 2018-06-20 /pmc/articles/PMC6142223/ /pubmed/29925596 http://dx.doi.org/10.1073/pnas.1802640115 Text en Copyright © 2018 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle PNAS Plus
Hawkins, John A.
Jones, Stephen K.
Finkelstein, Ilya J.
Press, William H.
Indel-correcting DNA barcodes for high-throughput sequencing
title Indel-correcting DNA barcodes for high-throughput sequencing
title_full Indel-correcting DNA barcodes for high-throughput sequencing
title_fullStr Indel-correcting DNA barcodes for high-throughput sequencing
title_full_unstemmed Indel-correcting DNA barcodes for high-throughput sequencing
title_short Indel-correcting DNA barcodes for high-throughput sequencing
title_sort indel-correcting dna barcodes for high-throughput sequencing
topic PNAS Plus
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6142223/
https://www.ncbi.nlm.nih.gov/pubmed/29925596
http://dx.doi.org/10.1073/pnas.1802640115
work_keys_str_mv AT hawkinsjohna indelcorrectingdnabarcodesforhighthroughputsequencing
AT jonesstephenk indelcorrectingdnabarcodesforhighthroughputsequencing
AT finkelsteinilyaj indelcorrectingdnabarcodesforhighthroughputsequencing
AT presswilliamh indelcorrectingdnabarcodesforhighthroughputsequencing