Cargando…

QuorUM: An Error Corrector for Illumina Reads

MOTIVATION: Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make...

Descripción completa

Detalles Bibliográficos
Autores principales:	Marçais, Guillaume, Yorke, James A., Zimin, Aleksey
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4471408/ https://www.ncbi.nlm.nih.gov/pubmed/26083032 http://dx.doi.org/10.1371/journal.pone.0130821

_version_	1782376914840715264
author	Marçais, Guillaume Yorke, James A. Zimin, Aleksey
author_facet	Marçais, Guillaume Yorke, James A. Zimin, Aleksey
author_sort	Marçais, Guillaume
collection	PubMed
description	MOTIVATION: Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. RESULTS: We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. AVAILABILITY: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. CONTACT: gmarcais@umd.edu.
format	Online Article Text
id	pubmed-4471408
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-44714082015-06-29 QuorUM: An Error Corrector for Illumina Reads Marçais, Guillaume Yorke, James A. Zimin, Aleksey PLoS One Research Article MOTIVATION: Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. RESULTS: We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. AVAILABILITY: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. CONTACT: gmarcais@umd.edu. Public Library of Science 2015-06-17 /pmc/articles/PMC4471408/ /pubmed/26083032 http://dx.doi.org/10.1371/journal.pone.0130821 Text en © 2015 Marçais et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Marçais, Guillaume Yorke, James A. Zimin, Aleksey QuorUM: An Error Corrector for Illumina Reads
title	QuorUM: An Error Corrector for Illumina Reads
title_full	QuorUM: An Error Corrector for Illumina Reads
title_fullStr	QuorUM: An Error Corrector for Illumina Reads
title_full_unstemmed	QuorUM: An Error Corrector for Illumina Reads
title_short	QuorUM: An Error Corrector for Illumina Reads
title_sort	quorum: an error corrector for illumina reads
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4471408/ https://www.ncbi.nlm.nih.gov/pubmed/26083032 http://dx.doi.org/10.1371/journal.pone.0130821
work_keys_str_mv	AT marcaisguillaume quorumanerrorcorrectorforilluminareads AT yorkejamesa quorumanerrorcorrectorforilluminareads AT ziminaleksey quorumanerrorcorrectorforilluminareads

QuorUM: An Error Corrector for Illumina Reads

Ejemplares similares