Cargando…
The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
BACKGROUND: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765045/ https://www.ncbi.nlm.nih.gov/pubmed/26913054 http://dx.doi.org/10.1186/s13015-016-0063-y |
_version_ | 1782417489614864384 |
---|---|
author | Indrischek, Henrike Wieseke, Nicolas Stadler, Peter F. Prohaska, Sonja J. |
author_facet | Indrischek, Henrike Wieseke, Nicolas Stadler, Peter F. Prohaska, Sonja J. |
author_sort | Indrischek, Henrike |
collection | PubMed |
description | BACKGROUND: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories. RESULTS: We present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem. CONCLUSIONS: The ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0063-y) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4765045 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-47650452016-02-25 The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies Indrischek, Henrike Wieseke, Nicolas Stadler, Peter F. Prohaska, Sonja J. Algorithms Mol Biol Research BACKGROUND: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories. RESULTS: We present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem. CONCLUSIONS: The ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0063-y) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-24 /pmc/articles/PMC4765045/ /pubmed/26913054 http://dx.doi.org/10.1186/s13015-016-0063-y Text en © Indrischek et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Indrischek, Henrike Wieseke, Nicolas Stadler, Peter F. Prohaska, Sonja J. The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies |
title | The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies |
title_full | The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies |
title_fullStr | The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies |
title_full_unstemmed | The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies |
title_short | The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies |
title_sort | paralog-to-contig assignment problem: high quality gene models from fragmented assemblies |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4765045/ https://www.ncbi.nlm.nih.gov/pubmed/26913054 http://dx.doi.org/10.1186/s13015-016-0063-y |
work_keys_str_mv | AT indrischekhenrike theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies AT wiesekenicolas theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies AT stadlerpeterf theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies AT prohaskasonjaj theparalogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies AT indrischekhenrike paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies AT wiesekenicolas paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies AT stadlerpeterf paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies AT prohaskasonjaj paralogtocontigassignmentproblemhighqualitygenemodelsfromfragmentedassemblies |