Cargando…

Foster thy young: enhanced prediction of orphan genes in assembled genomes

Proteins encoded by newly-emerged genes (‘orphan genes’) share no sequence similarity with proteins in any other species. They provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Here, we systematically assess the ability of five gene prediction...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Jing, Singh, Urminder, Bhandary, Priyanka, Campbell, Jacqueline, Arendsee, Zebulun, Seetharam, Arun S, Wurtele, Eve Syrkin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9023268/
https://www.ncbi.nlm.nih.gov/pubmed/34928390
http://dx.doi.org/10.1093/nar/gkab1238
_version_ 1784690305216282624
author Li, Jing
Singh, Urminder
Bhandary, Priyanka
Campbell, Jacqueline
Arendsee, Zebulun
Seetharam, Arun S
Wurtele, Eve Syrkin
author_facet Li, Jing
Singh, Urminder
Bhandary, Priyanka
Campbell, Jacqueline
Arendsee, Zebulun
Seetharam, Arun S
Wurtele, Eve Syrkin
author_sort Li, Jing
collection PubMed
description Proteins encoded by newly-emerged genes (‘orphan genes’) share no sequence similarity with proteins in any other species. They provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Here, we systematically assess the ability of five gene prediction pipelines to accurately predict genes in genomes according to phylostratal origin. BRAKER and MAKER are existing, popular ab initio tools that infer gene structures by machine learning. Direct Inference is an evidence-based pipeline we developed to predict gene structures from alignments of RNA-Seq data. The BIND pipeline integrates ab initio predictions of BRAKER and Direct inference; MIND combines Direct Inference and MAKER predictions. We use highly-curated Arabidopsis and yeast annotations as gold-standard benchmarks, and cross-validate in rice. Each pipeline under-predicts orphan genes (as few as 11 percent, under one prediction scenario). Increasing RNA-Seq diversity greatly improves prediction efficacy. The combined methods (BIND and MIND) yield best predictions overall, BIND identifying 68% of annotated orphan genes, 99% of ancient genes, and give the highest sensitivity score regardless dataset in Arabidopsis. We provide a light weight, flexible, reproducible, and well-documented solution to improve gene prediction.
format Online
Article
Text
id pubmed-9023268
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-90232682022-04-22 Foster thy young: enhanced prediction of orphan genes in assembled genomes Li, Jing Singh, Urminder Bhandary, Priyanka Campbell, Jacqueline Arendsee, Zebulun Seetharam, Arun S Wurtele, Eve Syrkin Nucleic Acids Res Methods Online Proteins encoded by newly-emerged genes (‘orphan genes’) share no sequence similarity with proteins in any other species. They provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Here, we systematically assess the ability of five gene prediction pipelines to accurately predict genes in genomes according to phylostratal origin. BRAKER and MAKER are existing, popular ab initio tools that infer gene structures by machine learning. Direct Inference is an evidence-based pipeline we developed to predict gene structures from alignments of RNA-Seq data. The BIND pipeline integrates ab initio predictions of BRAKER and Direct inference; MIND combines Direct Inference and MAKER predictions. We use highly-curated Arabidopsis and yeast annotations as gold-standard benchmarks, and cross-validate in rice. Each pipeline under-predicts orphan genes (as few as 11 percent, under one prediction scenario). Increasing RNA-Seq diversity greatly improves prediction efficacy. The combined methods (BIND and MIND) yield best predictions overall, BIND identifying 68% of annotated orphan genes, 99% of ancient genes, and give the highest sensitivity score regardless dataset in Arabidopsis. We provide a light weight, flexible, reproducible, and well-documented solution to improve gene prediction. Oxford University Press 2021-12-20 /pmc/articles/PMC9023268/ /pubmed/34928390 http://dx.doi.org/10.1093/nar/gkab1238 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods Online
Li, Jing
Singh, Urminder
Bhandary, Priyanka
Campbell, Jacqueline
Arendsee, Zebulun
Seetharam, Arun S
Wurtele, Eve Syrkin
Foster thy young: enhanced prediction of orphan genes in assembled genomes
title Foster thy young: enhanced prediction of orphan genes in assembled genomes
title_full Foster thy young: enhanced prediction of orphan genes in assembled genomes
title_fullStr Foster thy young: enhanced prediction of orphan genes in assembled genomes
title_full_unstemmed Foster thy young: enhanced prediction of orphan genes in assembled genomes
title_short Foster thy young: enhanced prediction of orphan genes in assembled genomes
title_sort foster thy young: enhanced prediction of orphan genes in assembled genomes
topic Methods Online
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9023268/
https://www.ncbi.nlm.nih.gov/pubmed/34928390
http://dx.doi.org/10.1093/nar/gkab1238
work_keys_str_mv AT lijing fosterthyyoungenhancedpredictionoforphangenesinassembledgenomes
AT singhurminder fosterthyyoungenhancedpredictionoforphangenesinassembledgenomes
AT bhandarypriyanka fosterthyyoungenhancedpredictionoforphangenesinassembledgenomes
AT campbelljacqueline fosterthyyoungenhancedpredictionoforphangenesinassembledgenomes
AT arendseezebulun fosterthyyoungenhancedpredictionoforphangenesinassembledgenomes
AT seetharamaruns fosterthyyoungenhancedpredictionoforphangenesinassembledgenomes
AT wurteleevesyrkin fosterthyyoungenhancedpredictionoforphangenesinassembledgenomes