Cargando…

Levenshtein error-correcting barcodes for multiplexed DNA sequencing

BACKGROUND: High-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Buschmann, Tilo, Bystrykh, Leonid V
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3853030/ https://www.ncbi.nlm.nih.gov/pubmed/24021088 http://dx.doi.org/10.1186/1471-2105-14-272

_version_	1782478768447684608
author	Buschmann, Tilo Bystrykh, Leonid V
author_facet	Buschmann, Tilo Bystrykh, Leonid V
author_sort	Buschmann, Tilo
collection	PubMed
description	BACKGROUND: High-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag or barcode that is attached to the sequencing or amplification primer and hence appears at the beginning of the sequence in every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence. Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and Levenshtein codes. RESULT: Levenshtein codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this paper we demonstrate the decreased error correction capability of Levenshtein codes in a DNA context and suggest an adaptation of Levenshtein codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaption we take the DNA context into account and redefine the word length whenever an insertion or deletion is revealed. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. CONCLUSION: We present an adaptation of Levenshtein codes to DNA contexts capable of correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of recovering the new length of the corrupted codeword and of correcting on average more random mutations than traditional Levenshtein or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.
format	Online Article Text
id	pubmed-3853030
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-38530302013-12-16 Levenshtein error-correcting barcodes for multiplexed DNA sequencing Buschmann, Tilo Bystrykh, Leonid V BMC Bioinformatics Methodology Article BACKGROUND: High-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag or barcode that is attached to the sequencing or amplification primer and hence appears at the beginning of the sequence in every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence. Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and Levenshtein codes. RESULT: Levenshtein codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this paper we demonstrate the decreased error correction capability of Levenshtein codes in a DNA context and suggest an adaptation of Levenshtein codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaption we take the DNA context into account and redefine the word length whenever an insertion or deletion is revealed. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. CONCLUSION: We present an adaptation of Levenshtein codes to DNA contexts capable of correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of recovering the new length of the corrupted codeword and of correcting on average more random mutations than traditional Levenshtein or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance. BioMed Central 2013-09-11 /pmc/articles/PMC3853030/ /pubmed/24021088 http://dx.doi.org/10.1186/1471-2105-14-272 Text en Copyright © 2013 Buschmann and Bystrykh; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Buschmann, Tilo Bystrykh, Leonid V Levenshtein error-correcting barcodes for multiplexed DNA sequencing
title	Levenshtein error-correcting barcodes for multiplexed DNA sequencing
title_full	Levenshtein error-correcting barcodes for multiplexed DNA sequencing
title_fullStr	Levenshtein error-correcting barcodes for multiplexed DNA sequencing
title_full_unstemmed	Levenshtein error-correcting barcodes for multiplexed DNA sequencing
title_short	Levenshtein error-correcting barcodes for multiplexed DNA sequencing
title_sort	levenshtein error-correcting barcodes for multiplexed dna sequencing
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3853030/ https://www.ncbi.nlm.nih.gov/pubmed/24021088 http://dx.doi.org/10.1186/1471-2105-14-272
work_keys_str_mv	AT buschmanntilo levenshteinerrorcorrectingbarcodesformultiplexeddnasequencing AT bystrykhleonidv levenshteinerrorcorrectingbarcodesformultiplexeddnasequencing

Levenshtein error-correcting barcodes for multiplexed DNA sequencing

Ejemplares similares