Cargando…

Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies

Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos t...

Descripción completa

Detalles Bibliográficos
Autores principales: Florea, Liliana, Souvorov, Alexander, Kalbfleisch, Theodore S., Salzberg, Steven L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120881/
https://www.ncbi.nlm.nih.gov/pubmed/21731731
http://dx.doi.org/10.1371/journal.pone.0021400
_version_ 1782206776825872384
author Florea, Liliana
Souvorov, Alexander
Kalbfleisch, Theodore S.
Salzberg, Steven L.
author_facet Florea, Liliana
Souvorov, Alexander
Kalbfleisch, Theodore S.
Salzberg, Steven L.
author_sort Florea, Liliana
collection PubMed
description Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12–20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6–15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly.
format Online
Article
Text
id pubmed-3120881
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-31208812011-06-30 Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies Florea, Liliana Souvorov, Alexander Kalbfleisch, Theodore S. Salzberg, Steven L. PLoS One Research Article Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12–20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6–15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly. Public Library of Science 2011-06-22 /pmc/articles/PMC3120881/ /pubmed/21731731 http://dx.doi.org/10.1371/journal.pone.0021400 Text en This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. https://creativecommons.org/publicdomain/zero/1.0/ This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
spellingShingle Research Article
Florea, Liliana
Souvorov, Alexander
Kalbfleisch, Theodore S.
Salzberg, Steven L.
Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
title Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
title_full Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
title_fullStr Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
title_full_unstemmed Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
title_short Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
title_sort genome assembly has a major impact on gene content: a comparison of annotation in two bos taurus assemblies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120881/
https://www.ncbi.nlm.nih.gov/pubmed/21731731
http://dx.doi.org/10.1371/journal.pone.0021400
work_keys_str_mv AT florealiliana genomeassemblyhasamajorimpactongenecontentacomparisonofannotationintwobostaurusassemblies
AT souvorovalexander genomeassemblyhasamajorimpactongenecontentacomparisonofannotationintwobostaurusassemblies
AT kalbfleischtheodores genomeassemblyhasamajorimpactongenecontentacomparisonofannotationintwobostaurusassemblies
AT salzbergstevenl genomeassemblyhasamajorimpactongenecontentacomparisonofannotationintwobostaurusassemblies