Cargando…

Evaluation of high-throughput functional categorization of human disease genes

BACKGROUND: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, James L, Liu, Yang, Sam, Lee T, Li, Jianrong, Lussier, Yves A
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892104/
https://www.ncbi.nlm.nih.gov/pubmed/17493290
http://dx.doi.org/10.1186/1471-2105-8-S3-S7
_version_ 1782133828227170304
author Chen, James L
Liu, Yang
Sam, Lee T
Li, Jianrong
Lussier, Yves A
author_facet Chen, James L
Liu, Yang
Sam, Lee T
Li, Jianrong
Lussier, Yves A
author_sort Chen, James L
collection PubMed
description BACKGROUND: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. RESULTS: Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. CONCLUSION: Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.
format Text
id pubmed-1892104
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18921042007-06-15 Evaluation of high-throughput functional categorization of human disease genes Chen, James L Liu, Yang Sam, Lee T Li, Jianrong Lussier, Yves A BMC Bioinformatics Research BACKGROUND: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. RESULTS: Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. CONCLUSION: Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations. BioMed Central 2007-05-09 /pmc/articles/PMC1892104/ /pubmed/17493290 http://dx.doi.org/10.1186/1471-2105-8-S3-S7 Text en Copyright © 2007 Chen et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Chen, James L
Liu, Yang
Sam, Lee T
Li, Jianrong
Lussier, Yves A
Evaluation of high-throughput functional categorization of human disease genes
title Evaluation of high-throughput functional categorization of human disease genes
title_full Evaluation of high-throughput functional categorization of human disease genes
title_fullStr Evaluation of high-throughput functional categorization of human disease genes
title_full_unstemmed Evaluation of high-throughput functional categorization of human disease genes
title_short Evaluation of high-throughput functional categorization of human disease genes
title_sort evaluation of high-throughput functional categorization of human disease genes
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892104/
https://www.ncbi.nlm.nih.gov/pubmed/17493290
http://dx.doi.org/10.1186/1471-2105-8-S3-S7
work_keys_str_mv AT chenjamesl evaluationofhighthroughputfunctionalcategorizationofhumandiseasegenes
AT liuyang evaluationofhighthroughputfunctionalcategorizationofhumandiseasegenes
AT samleet evaluationofhighthroughputfunctionalcategorizationofhumandiseasegenes
AT lijianrong evaluationofhighthroughputfunctionalcategorizationofhumandiseasegenes
AT lussieryvesa evaluationofhighthroughputfunctionalcategorizationofhumandiseasegenes