Cargando…

GO for gene documents

BACKGROUND: Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the biomedical literature. The goal in this paper, based on this approach, is to develop automatic annotation methods that...

Descripción completa

Detalles Bibliográficos
Autores principales: Srinivasan, Padmini, Qiu, Xin Ying
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2217661/
https://www.ncbi.nlm.nih.gov/pubmed/18047704
http://dx.doi.org/10.1186/1471-2105-8-S9-S3
_version_ 1782149295145746432
author Srinivasan, Padmini
Qiu, Xin Ying
author_facet Srinivasan, Padmini
Qiu, Xin Ying
author_sort Srinivasan, Padmini
collection PubMed
description BACKGROUND: Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the biomedical literature. The goal in this paper, based on this approach, is to develop automatic annotation methods that can supplement the expensive manual annotation processes currently in place. RESULTS: Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.49, 0.41 and 0.33 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We find that alternative term weighting strategies are not different from each other in performance and feature selection strategies reduce performance. The best thresholding strategy is one where a single threshold is picked for each hierarchy. Hierarchy level is important especially for molecular function and biological process. The cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research shows that it is possible to beneficially exploit the hierarchical structures by defining and testing a relaxed criteria for classification correctness. Finally it is possible to build classifiers for codes with very few associated documents but as expected a huge penalty is paid in performance. CONCLUSION: The GO annotation problem is complex. Several key observations have been made as for example about topic drift that may be important to consider in annotation strategies.
format Text
id pubmed-2217661
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22176612008-01-31 GO for gene documents Srinivasan, Padmini Qiu, Xin Ying BMC Bioinformatics Proceedings BACKGROUND: Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the biomedical literature. The goal in this paper, based on this approach, is to develop automatic annotation methods that can supplement the expensive manual annotation processes currently in place. RESULTS: Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.49, 0.41 and 0.33 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We find that alternative term weighting strategies are not different from each other in performance and feature selection strategies reduce performance. The best thresholding strategy is one where a single threshold is picked for each hierarchy. Hierarchy level is important especially for molecular function and biological process. The cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research shows that it is possible to beneficially exploit the hierarchical structures by defining and testing a relaxed criteria for classification correctness. Finally it is possible to build classifiers for codes with very few associated documents but as expected a huge penalty is paid in performance. CONCLUSION: The GO annotation problem is complex. Several key observations have been made as for example about topic drift that may be important to consider in annotation strategies. BioMed Central 2007-11-27 /pmc/articles/PMC2217661/ /pubmed/18047704 http://dx.doi.org/10.1186/1471-2105-8-S9-S3 Text en Copyright © 2007 Srinivasan and Qiu; licensee BioMed Central Ltd.
spellingShingle Proceedings
Srinivasan, Padmini
Qiu, Xin Ying
GO for gene documents
title GO for gene documents
title_full GO for gene documents
title_fullStr GO for gene documents
title_full_unstemmed GO for gene documents
title_short GO for gene documents
title_sort go for gene documents
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2217661/
https://www.ncbi.nlm.nih.gov/pubmed/18047704
http://dx.doi.org/10.1186/1471-2105-8-S9-S3
work_keys_str_mv AT srinivasanpadmini goforgenedocuments
AT qiuxinying goforgenedocuments