Cargando…

The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies

BACKGROUND: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is...

Descripción completa

Detalles Bibliográficos
Autores principales: Indrischek, Henrike, Wieseke, Nicolas, Stadler, Peter F., Prohaska, Sonja J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765045/
https://www.ncbi.nlm.nih.gov/pubmed/26913054
http://dx.doi.org/10.1186/s13015-016-0063-y
_version_ 1782417489614864384
author Indrischek, Henrike
Wieseke, Nicolas
Stadler, Peter F.
Prohaska, Sonja J.
author_facet Indrischek, Henrike
Wieseke, Nicolas
Stadler, Peter F.
Prohaska, Sonja J.
author_sort Indrischek, Henrike
collection PubMed
description BACKGROUND: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories. RESULTS: We present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem. CONCLUSIONS: The ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0063-y) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4765045
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47650452016-02-25 The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies Indrischek, Henrike Wieseke, Nicolas Stadler, Peter F. Prohaska, Sonja J. Algorithms Mol Biol Research BACKGROUND: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories. RESULTS: We present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem. CONCLUSIONS: The ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0063-y) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-24 /pmc/articles/PMC4765045/ /pubmed/26913054 http://dx.doi.org/10.1186/s13015-016-0063-y Text en © Indrischek et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Indrischek, Henrike
Wieseke, Nicolas
Stadler, Peter F.
Prohaska, Sonja J.
The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
title The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
title_full The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
title_fullStr The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
title_full_unstemmed The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
title_short The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
title_sort paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765045/
https://www.ncbi.nlm.nih.gov/pubmed/26913054
http://dx.doi.org/10.1186/s13015-016-0063-y
work_keys_str_mv AT indrischekhenrike theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies
AT wiesekenicolas theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies
AT stadlerpeterf theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies
AT prohaskasonjaj theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies
AT indrischekhenrike paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies
AT wiesekenicolas paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies
AT stadlerpeterf paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies
AT prohaskasonjaj paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies