Cargando…

In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles

BACKGROUND: In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery...

Descripción completa

Detalles Bibliográficos
Autores principales: Lin, Frank PY, Coiera, Enrico, Lan, Ruiting, Sintchenko, Vitali
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2669486/
https://www.ncbi.nlm.nih.gov/pubmed/19292914
http://dx.doi.org/10.1186/1471-2105-10-86
_version_ 1782166259488522240
author Lin, Frank PY
Coiera, Enrico
Lan, Ruiting
Sintchenko, Vitali
author_facet Lin, Frank PY
Coiera, Enrico
Lan, Ruiting
Sintchenko, Vitali
author_sort Lin, Frank PY
collection PubMed
description BACKGROUND: In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task. RESULTS: Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms. CONCLUSION: Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.
format Text
id pubmed-2669486
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26694862009-04-16 In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles Lin, Frank PY Coiera, Enrico Lan, Ruiting Sintchenko, Vitali BMC Bioinformatics Research Article BACKGROUND: In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task. RESULTS: Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms. CONCLUSION: Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared. BioMed Central 2009-03-17 /pmc/articles/PMC2669486/ /pubmed/19292914 http://dx.doi.org/10.1186/1471-2105-10-86 Text en Copyright © 2009 Lin et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Lin, Frank PY
Coiera, Enrico
Lan, Ruiting
Sintchenko, Vitali
In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
title In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
title_full In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
title_fullStr In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
title_full_unstemmed In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
title_short In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
title_sort in silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2669486/
https://www.ncbi.nlm.nih.gov/pubmed/19292914
http://dx.doi.org/10.1186/1471-2105-10-86
work_keys_str_mv AT linfrankpy insilicoprioritisationofcandidategenesforprokaryoticgenefunctiondiscoveryanapplicationofphylogeneticprofiles
AT coieraenrico insilicoprioritisationofcandidategenesforprokaryoticgenefunctiondiscoveryanapplicationofphylogeneticprofiles
AT lanruiting insilicoprioritisationofcandidategenesforprokaryoticgenefunctiondiscoveryanapplicationofphylogeneticprofiles
AT sintchenkovitali insilicoprioritisationofcandidategenesforprokaryoticgenefunctiondiscoveryanapplicationofphylogeneticprofiles