Cargando…

De novo likelihood-based measures for comparing genome assemblies

BACKGROUND: The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of doze...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ghodsi, Mohammadreza, Hill, Christopher M, Astrovskaya, Irina, Lin, Henry, Sommer, Dan D, Koren, Sergey, Pop, Mihai
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3765854/ https://www.ncbi.nlm.nih.gov/pubmed/23965294 http://dx.doi.org/10.1186/1756-0500-6-334

_version_	1782283406587985920
author	Ghodsi, Mohammadreza Hill, Christopher M Astrovskaya, Irina Lin, Henry Sommer, Dan D Koren, Sergey Pop, Mihai
author_facet	Ghodsi, Mohammadreza Hill, Christopher M Astrovskaya, Irina Lin, Henry Sommer, Dan D Koren, Sergey Pop, Mihai
author_sort	Ghodsi, Mohammadreza
collection	PubMed
description	BACKGROUND: The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These “gold standards” can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. RESULTS: We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly “bake-offs” with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled. CONCLUSION: Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.
format	Online Article Text
id	pubmed-3765854
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-37658542013-09-12 De novo likelihood-based measures for comparing genome assemblies Ghodsi, Mohammadreza Hill, Christopher M Astrovskaya, Irina Lin, Henry Sommer, Dan D Koren, Sergey Pop, Mihai BMC Res Notes Research Article BACKGROUND: The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These “gold standards” can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. RESULTS: We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly “bake-offs” with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled. CONCLUSION: Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation. BioMed Central 2013-08-22 /pmc/articles/PMC3765854/ /pubmed/23965294 http://dx.doi.org/10.1186/1756-0500-6-334 Text en Copyright © 2013 Ghodsi et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Ghodsi, Mohammadreza Hill, Christopher M Astrovskaya, Irina Lin, Henry Sommer, Dan D Koren, Sergey Pop, Mihai De novo likelihood-based measures for comparing genome assemblies
title	De novo likelihood-based measures for comparing genome assemblies
title_full	De novo likelihood-based measures for comparing genome assemblies
title_fullStr	De novo likelihood-based measures for comparing genome assemblies
title_full_unstemmed	De novo likelihood-based measures for comparing genome assemblies
title_short	De novo likelihood-based measures for comparing genome assemblies
title_sort	de novo likelihood-based measures for comparing genome assemblies
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3765854/ https://www.ncbi.nlm.nih.gov/pubmed/23965294 http://dx.doi.org/10.1186/1756-0500-6-334
work_keys_str_mv	AT ghodsimohammadreza denovolikelihoodbasedmeasuresforcomparinggenomeassemblies AT hillchristopherm denovolikelihoodbasedmeasuresforcomparinggenomeassemblies AT astrovskayairina denovolikelihoodbasedmeasuresforcomparinggenomeassemblies AT linhenry denovolikelihoodbasedmeasuresforcomparinggenomeassemblies AT sommerdand denovolikelihoodbasedmeasuresforcomparinggenomeassemblies AT korensergey denovolikelihoodbasedmeasuresforcomparinggenomeassemblies AT popmihai denovolikelihoodbasedmeasuresforcomparinggenomeassemblies

De novo likelihood-based measures for comparing genome assemblies

Ejemplares similares