Cargando…

Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the ma...

Descripción completa

Detalles Bibliográficos
Autores principales: Denton, James F., Lugo-Martinez, Jose, Tucker, Abraham E., Schrider, Daniel R., Warren, Wesley C., Hahn, Matthew W.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256071/
https://www.ncbi.nlm.nih.gov/pubmed/25474019
http://dx.doi.org/10.1371/journal.pcbi.1003998
_version_ 1782347534976417792
author Denton, James F.
Lugo-Martinez, Jose
Tucker, Abraham E.
Schrider, Daniel R.
Warren, Wesley C.
Hahn, Matthew W.
author_facet Denton, James F.
Lugo-Martinez, Jose
Tucker, Abraham E.
Schrider, Daniel R.
Warren, Wesley C.
Hahn, Matthew W.
author_sort Denton, James F.
collection PubMed
description Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.
format Online
Article
Text
id pubmed-4256071
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-42560712014-12-11 Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies Denton, James F. Lugo-Martinez, Jose Tucker, Abraham E. Schrider, Daniel R. Warren, Wesley C. Hahn, Matthew W. PLoS Comput Biol Research Article Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process. Public Library of Science 2014-12-04 /pmc/articles/PMC4256071/ /pubmed/25474019 http://dx.doi.org/10.1371/journal.pcbi.1003998 Text en © 2014 Denton et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Denton, James F.
Lugo-Martinez, Jose
Tucker, Abraham E.
Schrider, Daniel R.
Warren, Wesley C.
Hahn, Matthew W.
Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
title Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
title_full Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
title_fullStr Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
title_full_unstemmed Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
title_short Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
title_sort extensive error in the number of genes inferred from draft genome assemblies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256071/
https://www.ncbi.nlm.nih.gov/pubmed/25474019
http://dx.doi.org/10.1371/journal.pcbi.1003998
work_keys_str_mv AT dentonjamesf extensiveerrorinthenumberofgenesinferredfromdraftgenomeassemblies
AT lugomartinezjose extensiveerrorinthenumberofgenesinferredfromdraftgenomeassemblies
AT tuckerabrahame extensiveerrorinthenumberofgenesinferredfromdraftgenomeassemblies
AT schriderdanielr extensiveerrorinthenumberofgenesinferredfromdraftgenomeassemblies
AT warrenwesleyc extensiveerrorinthenumberofgenesinferredfromdraftgenomeassemblies
AT hahnmattheww extensiveerrorinthenumberofgenesinferredfromdraftgenomeassemblies