Cargando…

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

BACKGROUND: Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content...

Descripción completa

Detalles Bibliográficos
Autores principales: Bowman, Megan J., Pulman, Jane A., Liu, Tiffany L., Childs, Kevin L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5702205/
https://www.ncbi.nlm.nih.gov/pubmed/29178822
http://dx.doi.org/10.1186/s12859-017-1942-z
_version_ 1783281480085012480
author Bowman, Megan J.
Pulman, Jane A.
Liu, Tiffany L.
Childs, Kevin L.
author_facet Bowman, Megan J.
Pulman, Jane A.
Liu, Tiffany L.
Childs, Kevin L.
author_sort Bowman, Megan J.
collection PubMed
description BACKGROUND: Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. When gene prediction programs are trained on a subset of grass genes with random GC content, they are effectively being trained on two classes of genes at once, and this can be expected to result in poor results when genes are predicted in new genome sequences. RESULTS: We find that gene prediction programs trained on grass genes with random GC content do not completely predict all grass genes with extreme GC content. We show that gene prediction programs that are trained with grass genes with high or low GC content can make both better and unique gene predictions compared to gene prediction programs that are trained on genes with random GC content. By separately training gene prediction programs with genes from multiple GC ranges and using the programs within the MAKER genome annotation pipeline, we were able to improve the annotation of the Oryza sativa genome compared to using the standard MAKER annotation protocol. Gene structure was improved in over 13% of genes, and 651 novel genes were predicted by the GC-specific MAKER protocol. CONCLUSIONS: We present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa. We expect that this protocol will also be beneficial for gene prediction in any organism with bimodal or other unusual gene GC content. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1942-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5702205
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-57022052017-12-04 A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa Bowman, Megan J. Pulman, Jane A. Liu, Tiffany L. Childs, Kevin L. BMC Bioinformatics Methodology Article BACKGROUND: Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. When gene prediction programs are trained on a subset of grass genes with random GC content, they are effectively being trained on two classes of genes at once, and this can be expected to result in poor results when genes are predicted in new genome sequences. RESULTS: We find that gene prediction programs trained on grass genes with random GC content do not completely predict all grass genes with extreme GC content. We show that gene prediction programs that are trained with grass genes with high or low GC content can make both better and unique gene predictions compared to gene prediction programs that are trained on genes with random GC content. By separately training gene prediction programs with genes from multiple GC ranges and using the programs within the MAKER genome annotation pipeline, we were able to improve the annotation of the Oryza sativa genome compared to using the standard MAKER annotation protocol. Gene structure was improved in over 13% of genes, and 651 novel genes were predicted by the GC-specific MAKER protocol. CONCLUSIONS: We present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa. We expect that this protocol will also be beneficial for gene prediction in any organism with bimodal or other unusual gene GC content. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1942-z) contains supplementary material, which is available to authorized users. BioMed Central 2017-11-25 /pmc/articles/PMC5702205/ /pubmed/29178822 http://dx.doi.org/10.1186/s12859-017-1942-z Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Bowman, Megan J.
Pulman, Jane A.
Liu, Tiffany L.
Childs, Kevin L.
A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa
title A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa
title_full A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa
title_fullStr A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa
title_full_unstemmed A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa
title_short A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa
title_sort modified gc-specific maker gene annotation method reveals improved and novel gene predictions of high and low gc content in oryza sativa
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5702205/
https://www.ncbi.nlm.nih.gov/pubmed/29178822
http://dx.doi.org/10.1186/s12859-017-1942-z
work_keys_str_mv AT bowmanmeganj amodifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa
AT pulmanjanea amodifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa
AT liutiffanyl amodifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa
AT childskevinl amodifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa
AT bowmanmeganj modifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa
AT pulmanjanea modifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa
AT liutiffanyl modifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa
AT childskevinl modifiedgcspecificmakergeneannotationmethodrevealsimprovedandnovelgenepredictionsofhighandlowgccontentinoryzasativa