Cargando…

Evaluating Plant Gene Models Using Machine Learning

Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident...

Descripción completa

Detalles Bibliográficos
Autores principales: Upadhyaya, Shriprabha R., Bayer, Philipp E., Tay Fernandez, Cassandria G., Petereit, Jakob, Batley, Jacqueline, Bennamoun, Mohammed, Boussaid, Farid, Edwards, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9230120/
https://www.ncbi.nlm.nih.gov/pubmed/35736770
http://dx.doi.org/10.3390/plants11121619
_version_ 1784734982364725248
author Upadhyaya, Shriprabha R.
Bayer, Philipp E.
Tay Fernandez, Cassandria G.
Petereit, Jakob
Batley, Jacqueline
Bennamoun, Mohammed
Boussaid, Farid
Edwards, David
author_facet Upadhyaya, Shriprabha R.
Bayer, Philipp E.
Tay Fernandez, Cassandria G.
Petereit, Jakob
Batley, Jacqueline
Bennamoun, Mohammed
Boussaid, Farid
Edwards, David
author_sort Upadhyaya, Shriprabha R.
collection PubMed
description Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Pisum sativum Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91–0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.
format Online
Article
Text
id pubmed-9230120
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-92301202022-06-25 Evaluating Plant Gene Models Using Machine Learning Upadhyaya, Shriprabha R. Bayer, Philipp E. Tay Fernandez, Cassandria G. Petereit, Jakob Batley, Jacqueline Bennamoun, Mohammed Boussaid, Farid Edwards, David Plants (Basel) Communication Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Pisum sativum Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91–0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes. MDPI 2022-06-20 /pmc/articles/PMC9230120/ /pubmed/35736770 http://dx.doi.org/10.3390/plants11121619 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Communication
Upadhyaya, Shriprabha R.
Bayer, Philipp E.
Tay Fernandez, Cassandria G.
Petereit, Jakob
Batley, Jacqueline
Bennamoun, Mohammed
Boussaid, Farid
Edwards, David
Evaluating Plant Gene Models Using Machine Learning
title Evaluating Plant Gene Models Using Machine Learning
title_full Evaluating Plant Gene Models Using Machine Learning
title_fullStr Evaluating Plant Gene Models Using Machine Learning
title_full_unstemmed Evaluating Plant Gene Models Using Machine Learning
title_short Evaluating Plant Gene Models Using Machine Learning
title_sort evaluating plant gene models using machine learning
topic Communication
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9230120/
https://www.ncbi.nlm.nih.gov/pubmed/35736770
http://dx.doi.org/10.3390/plants11121619
work_keys_str_mv AT upadhyayashriprabhar evaluatingplantgenemodelsusingmachinelearning
AT bayerphilippe evaluatingplantgenemodelsusingmachinelearning
AT tayfernandezcassandriag evaluatingplantgenemodelsusingmachinelearning
AT petereitjakob evaluatingplantgenemodelsusingmachinelearning
AT batleyjacqueline evaluatingplantgenemodelsusingmachinelearning
AT bennamounmohammed evaluatingplantgenemodelsusingmachinelearning
AT boussaidfarid evaluatingplantgenemodelsusingmachinelearning
AT edwardsdavid evaluatingplantgenemodelsusingmachinelearning