Cargando…

Single Nucleotide Polymorphisms Caused by Assembly Errors

We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled c...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kleffe, Jürgen, Weißmann, Robert, Schmitzberger, Florian F
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Libertas Academica 2010
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4510600/ https://www.ncbi.nlm.nih.gov/pubmed/26279623 http://dx.doi.org/10.4137/GEI.S3653

_version_	1782382199571480576
author	Kleffe, Jürgen Weißmann, Robert Schmitzberger, Florian F
author_facet	Kleffe, Jürgen Weißmann, Robert Schmitzberger, Florian F
author_sort	Kleffe, Jürgen
collection	PubMed
description	We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled contigs but on how the different sequence assemblies agree by content. Threefold consistently assembled genome regions are identified in order to estimate a lower bound of erroneously identified single nucleotide polymorphisms (SNP) caused by nothing but the process of mathematical sequence assembly. We identified 509 sequence triplets common to all three de-novo assemblies spanning only 34% (3.3 Mb) of the bacterial genome with 175 of these regions (~1.5 Mb) including erroneous SNPs and insertion/deletions. Within these triplets this on average leads to one error per 7,155 base pairs. Replacing the assembler Mira2 by the most recent version Mira3, the letter number even drops to 5,923. Our results therefore suggest that a considerably high number of erroneous SNPs may be present in current sequence data and mathematicians should urgently take up research on numerical stability of sequence assembly algorithms. Furthermore, even the latest versions of currently used assemblers produce erroneous SNPs that depend on the order reads are used as input. Such errors will severely hamper molecular diagnostics as well as relating genome variation and disease. This issue needs to be addressed urgently as the field is moving fast into clinical applications.
format	Online Article Text
id	pubmed-4510600
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	Libertas Academica
record_format	MEDLINE/PubMed
spelling	pubmed-45106002015-08-14 Single Nucleotide Polymorphisms Caused by Assembly Errors Kleffe, Jürgen Weißmann, Robert Schmitzberger, Florian F Genomics Insights Original Research We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled contigs but on how the different sequence assemblies agree by content. Threefold consistently assembled genome regions are identified in order to estimate a lower bound of erroneously identified single nucleotide polymorphisms (SNP) caused by nothing but the process of mathematical sequence assembly. We identified 509 sequence triplets common to all three de-novo assemblies spanning only 34% (3.3 Mb) of the bacterial genome with 175 of these regions (~1.5 Mb) including erroneous SNPs and insertion/deletions. Within these triplets this on average leads to one error per 7,155 base pairs. Replacing the assembler Mira2 by the most recent version Mira3, the letter number even drops to 5,923. Our results therefore suggest that a considerably high number of erroneous SNPs may be present in current sequence data and mathematicians should urgently take up research on numerical stability of sequence assembly algorithms. Furthermore, even the latest versions of currently used assemblers produce erroneous SNPs that depend on the order reads are used as input. Such errors will severely hamper molecular diagnostics as well as relating genome variation and disease. This issue needs to be addressed urgently as the field is moving fast into clinical applications. Libertas Academica 2010-02-04 /pmc/articles/PMC4510600/ /pubmed/26279623 http://dx.doi.org/10.4137/GEI.S3653 Text en © 2010 the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.
spellingShingle	Original Research Kleffe, Jürgen Weißmann, Robert Schmitzberger, Florian F Single Nucleotide Polymorphisms Caused by Assembly Errors
title	Single Nucleotide Polymorphisms Caused by Assembly Errors
title_full	Single Nucleotide Polymorphisms Caused by Assembly Errors
title_fullStr	Single Nucleotide Polymorphisms Caused by Assembly Errors
title_full_unstemmed	Single Nucleotide Polymorphisms Caused by Assembly Errors
title_short	Single Nucleotide Polymorphisms Caused by Assembly Errors
title_sort	single nucleotide polymorphisms caused by assembly errors
topic	Original Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4510600/ https://www.ncbi.nlm.nih.gov/pubmed/26279623 http://dx.doi.org/10.4137/GEI.S3653
work_keys_str_mv	AT kleffejurgen singlenucleotidepolymorphismscausedbyassemblyerrors AT weißmannrobert singlenucleotidepolymorphismscausedbyassemblyerrors AT schmitzbergerflorianf singlenucleotidepolymorphismscausedbyassemblyerrors

Single Nucleotide Polymorphisms Caused by Assembly Errors

Ejemplares similares