Cargando…

SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing

BACKGROUND: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on s...

Descripción completa

Detalles Bibliográficos
Autores principales: Chu, Hsueh-Ting, Hsiao, William WL., Tsao, Theresa TH., Hsu, D. Frank, Chen, Chaur-Chin, Lee, Sheng-An, Kao, Cheng-Yan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3609794/
https://www.ncbi.nlm.nih.gov/pubmed/23544073
http://dx.doi.org/10.1371/journal.pone.0059484
_version_ 1782264365147226112
author Chu, Hsueh-Ting
Hsiao, William WL.
Tsao, Theresa TH.
Hsu, D. Frank
Chen, Chaur-Chin
Lee, Sheng-An
Kao, Cheng-Yan
author_facet Chu, Hsueh-Ting
Hsiao, William WL.
Tsao, Theresa TH.
Hsu, D. Frank
Chen, Chaur-Chin
Lee, Sheng-An
Kao, Cheng-Yan
author_sort Chu, Hsueh-Ting
collection PubMed
description BACKGROUND: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths. METHODOLOGY/PRINCIPAL FINDINGS: We define a metric H((k)) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH((k)) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH((k))>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH((k))<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH((k))<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH((k)) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures. CONCLUSIONS/SIGNIFICANCE: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.
format Online
Article
Text
id pubmed-3609794
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-36097942013-03-29 SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing Chu, Hsueh-Ting Hsiao, William WL. Tsao, Theresa TH. Hsu, D. Frank Chen, Chaur-Chin Lee, Sheng-An Kao, Cheng-Yan PLoS One Research Article BACKGROUND: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths. METHODOLOGY/PRINCIPAL FINDINGS: We define a metric H((k)) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH((k)) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH((k))>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH((k))<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH((k))<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH((k)) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures. CONCLUSIONS/SIGNIFICANCE: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available. Public Library of Science 2013-03-27 /pmc/articles/PMC3609794/ /pubmed/23544073 http://dx.doi.org/10.1371/journal.pone.0059484 Text en © 2013 Chu et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Chu, Hsueh-Ting
Hsiao, William WL.
Tsao, Theresa TH.
Hsu, D. Frank
Chen, Chaur-Chin
Lee, Sheng-An
Kao, Cheng-Yan
SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing
title SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing
title_full SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing
title_fullStr SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing
title_full_unstemmed SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing
title_short SeqEntropy: Genome-Wide Assessment of Repeats for Short Read Sequencing
title_sort seqentropy: genome-wide assessment of repeats for short read sequencing
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3609794/
https://www.ncbi.nlm.nih.gov/pubmed/23544073
http://dx.doi.org/10.1371/journal.pone.0059484
work_keys_str_mv AT chuhsuehting seqentropygenomewideassessmentofrepeatsforshortreadsequencing
AT hsiaowilliamwl seqentropygenomewideassessmentofrepeatsforshortreadsequencing
AT tsaotheresath seqentropygenomewideassessmentofrepeatsforshortreadsequencing
AT hsudfrank seqentropygenomewideassessmentofrepeatsforshortreadsequencing
AT chenchaurchin seqentropygenomewideassessmentofrepeatsforshortreadsequencing
AT leeshengan seqentropygenomewideassessmentofrepeatsforshortreadsequencing
AT kaochengyan seqentropygenomewideassessmentofrepeatsforshortreadsequencing