Cargando…

Repeat-aware modeling and correction of short read errors

BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Xiao, Aluru, Srinivas, Dorman, Karin S
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044310/
https://www.ncbi.nlm.nih.gov/pubmed/21342585
http://dx.doi.org/10.1186/1471-2105-12-S1-S52
_version_ 1782198717589225472
author Yang, Xiao
Aluru, Srinivas
Dorman, Karin S
author_facet Yang, Xiao
Aluru, Srinivas
Dorman, Karin S
author_sort Yang, Xiao
collection PubMed
description BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. RESULTS: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”. CONCLUSIONS: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
format Text
id pubmed-3044310
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30443102011-02-25 Repeat-aware modeling and correction of short read errors Yang, Xiao Aluru, Srinivas Dorman, Karin S BMC Bioinformatics Research BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. RESULTS: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”. CONCLUSIONS: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content. BioMed Central 2011-02-15 /pmc/articles/PMC3044310/ /pubmed/21342585 http://dx.doi.org/10.1186/1471-2105-12-S1-S52 Text en Copyright ©2011 Yang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Yang, Xiao
Aluru, Srinivas
Dorman, Karin S
Repeat-aware modeling and correction of short read errors
title Repeat-aware modeling and correction of short read errors
title_full Repeat-aware modeling and correction of short read errors
title_fullStr Repeat-aware modeling and correction of short read errors
title_full_unstemmed Repeat-aware modeling and correction of short read errors
title_short Repeat-aware modeling and correction of short read errors
title_sort repeat-aware modeling and correction of short read errors
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044310/
https://www.ncbi.nlm.nih.gov/pubmed/21342585
http://dx.doi.org/10.1186/1471-2105-12-S1-S52
work_keys_str_mv AT yangxiao repeatawaremodelingandcorrectionofshortreaderrors
AT alurusrinivas repeatawaremodelingandcorrectionofshortreaderrors
AT dormankarins repeatawaremodelingandcorrectionofshortreaderrors