Cargando…

Repeat-aware modeling and correction of short read errors

BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Xiao, Aluru, Srinivas, Dorman, Karin S
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044310/ https://www.ncbi.nlm.nih.gov/pubmed/21342585 http://dx.doi.org/10.1186/1471-2105-12-S1-S52

_version_	1782198717589225472
author	Yang, Xiao Aluru, Srinivas Dorman, Karin S
author_facet	Yang, Xiao Aluru, Srinivas Dorman, Karin S
author_sort	Yang, Xiao
collection	PubMed
description	BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. RESULTS: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”. CONCLUSIONS: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
format	Text
id	pubmed-3044310
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30443102011-02-25 Repeat-aware modeling and correction of short read errors Yang, Xiao Aluru, Srinivas Dorman, Karin S BMC Bioinformatics Research BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. RESULTS: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”. CONCLUSIONS: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content. BioMed Central 2011-02-15 /pmc/articles/PMC3044310/ /pubmed/21342585 http://dx.doi.org/10.1186/1471-2105-12-S1-S52 Text en Copyright ©2011 Yang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Yang, Xiao Aluru, Srinivas Dorman, Karin S Repeat-aware modeling and correction of short read errors
title	Repeat-aware modeling and correction of short read errors
title_full	Repeat-aware modeling and correction of short read errors
title_fullStr	Repeat-aware modeling and correction of short read errors
title_full_unstemmed	Repeat-aware modeling and correction of short read errors
title_short	Repeat-aware modeling and correction of short read errors
title_sort	repeat-aware modeling and correction of short read errors
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044310/ https://www.ncbi.nlm.nih.gov/pubmed/21342585 http://dx.doi.org/10.1186/1471-2105-12-S1-S52
work_keys_str_mv	AT yangxiao repeatawaremodelingandcorrectionofshortreaderrors AT alurusrinivas repeatawaremodelingandcorrectionofshortreaderrors AT dormankarins repeatawaremodelingandcorrectionofshortreaderrors

Repeat-aware modeling and correction of short read errors

Ejemplares similares