Cargando…

Training Set Selection for the Prediction of Essential Genes

Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poor...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cheng, Jian, Xu, Zhao, Wu, Wenwu, Zhao, Li, Li, Xiangchen, Liu, Yanlin, Tao, Shiheng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3899339/ https://www.ncbi.nlm.nih.gov/pubmed/24466248 http://dx.doi.org/10.1371/journal.pone.0086805

_version_	1782300559218311168
author	Cheng, Jian Xu, Zhao Wu, Wenwu Zhao, Li Li, Xiangchen Liu, Yanlin Tao, Shiheng
author_facet	Cheng, Jian Xu, Zhao Wu, Wenwu Zhao, Li Li, Xiangchen Liu, Yanlin Tao, Shiheng
author_sort	Cheng, Jian
collection	PubMed
description	Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.
format	Online Article Text
id	pubmed-3899339
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-38993392014-01-24 Training Set Selection for the Prediction of Essential Genes Cheng, Jian Xu, Zhao Wu, Wenwu Zhao, Li Li, Xiangchen Liu, Yanlin Tao, Shiheng PLoS One Research Article Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale. Public Library of Science 2014-01-22 /pmc/articles/PMC3899339/ /pubmed/24466248 http://dx.doi.org/10.1371/journal.pone.0086805 Text en © 2014 Cheng et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Cheng, Jian Xu, Zhao Wu, Wenwu Zhao, Li Li, Xiangchen Liu, Yanlin Tao, Shiheng Training Set Selection for the Prediction of Essential Genes
title	Training Set Selection for the Prediction of Essential Genes
title_full	Training Set Selection for the Prediction of Essential Genes
title_fullStr	Training Set Selection for the Prediction of Essential Genes
title_full_unstemmed	Training Set Selection for the Prediction of Essential Genes
title_short	Training Set Selection for the Prediction of Essential Genes
title_sort	training set selection for the prediction of essential genes
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3899339/ https://www.ncbi.nlm.nih.gov/pubmed/24466248 http://dx.doi.org/10.1371/journal.pone.0086805
work_keys_str_mv	AT chengjian trainingsetselectionforthepredictionofessentialgenes AT xuzhao trainingsetselectionforthepredictionofessentialgenes AT wuwenwu trainingsetselectionforthepredictionofessentialgenes AT zhaoli trainingsetselectionforthepredictionofessentialgenes AT lixiangchen trainingsetselectionforthepredictionofessentialgenes AT liuyanlin trainingsetselectionforthepredictionofessentialgenes AT taoshiheng trainingsetselectionforthepredictionofessentialgenes

Training Set Selection for the Prediction of Essential Genes

Ejemplares similares