Cargando…

Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment

BACKGROUND: Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due...

Descripción completa

Detalles Bibliográficos
Autores principales: Gotoh, Osamu, Morita, Mariko, Nelson, David R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4065584/
https://www.ncbi.nlm.nih.gov/pubmed/24927652
http://dx.doi.org/10.1186/1471-2105-15-189
_version_ 1782322110866128896
author Gotoh, Osamu
Morita, Mariko
Nelson, David R
author_facet Gotoh, Osamu
Morita, Mariko
Nelson, David R
author_sort Gotoh, Osamu
collection PubMed
description BACKGROUND: Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods. RESULTS: We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method. CONCLUSIONS: Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
format Online
Article
Text
id pubmed-4065584
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40655842014-06-22 Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment Gotoh, Osamu Morita, Mariko Nelson, David R BMC Bioinformatics Research Article BACKGROUND: Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods. RESULTS: We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method. CONCLUSIONS: Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants. BioMed Central 2014-06-14 /pmc/articles/PMC4065584/ /pubmed/24927652 http://dx.doi.org/10.1186/1471-2105-15-189 Text en Copyright © 2014 Gotoh et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Gotoh, Osamu
Morita, Mariko
Nelson, David R
Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
title Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
title_full Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
title_fullStr Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
title_full_unstemmed Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
title_short Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
title_sort assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4065584/
https://www.ncbi.nlm.nih.gov/pubmed/24927652
http://dx.doi.org/10.1186/1471-2105-15-189
work_keys_str_mv AT gotohosamu assessmentandrefinementofeukaryoticgenestructurepredictionwithgenestructureawaremultipleproteinsequencealignment
AT moritamariko assessmentandrefinementofeukaryoticgenestructurepredictionwithgenestructureawaremultipleproteinsequencealignment
AT nelsondavidr assessmentandrefinementofeukaryoticgenestructurepredictionwithgenestructureawaremultipleproteinsequencealignment