Cargando…

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free te...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gobeill, Julien, Pasche, Emilie, Vishnyakova, Dina, Ruch, Patrick
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2013
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3706742/ https://www.ncbi.nlm.nih.gov/pubmed/23842461 http://dx.doi.org/10.1093/database/bat041

_version_	1782276395828772864
author	Gobeill, Julien Pasche, Emilie Vishnyakova, Dina Ruch, Patrick
author_facet	Gobeill, Julien Pasche, Emilie Vishnyakova, Dina Ruch, Patrick
author_sort	Gobeill, Julien
collection	PubMed
description	The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based—or dictionary-based—approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. Database URL: http://eagl.unige.ch/GOCat/
format	Online Article Text
id	pubmed-3706742
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-37067422013-07-10 Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases Gobeill, Julien Pasche, Emilie Vishnyakova, Dina Ruch, Patrick Database (Oxford) Original Article The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based—or dictionary-based—approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. Database URL: http://eagl.unige.ch/GOCat/ Oxford University Press 2013-07-09 /pmc/articles/PMC3706742/ /pubmed/23842461 http://dx.doi.org/10.1093/database/bat041 Text en © The Author(s) 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Gobeill, Julien Pasche, Emilie Vishnyakova, Dina Ruch, Patrick Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases
title	Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases
title_full	Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases
title_fullStr	Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases
title_full_unstemmed	Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases
title_short	Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases
title_sort	managing the data deluge: data-driven go category assignment improves while complexity of functional annotation increases
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3706742/ https://www.ncbi.nlm.nih.gov/pubmed/23842461 http://dx.doi.org/10.1093/database/bat041
work_keys_str_mv	AT gobeilljulien managingthedatadelugedatadrivengocategoryassignmentimproveswhilecomplexityoffunctionalannotationincreases AT pascheemilie managingthedatadelugedatadrivengocategoryassignmentimproveswhilecomplexityoffunctionalannotationincreases AT vishnyakovadina managingthedatadelugedatadrivengocategoryassignmentimproveswhilecomplexityoffunctionalannotationincreases AT ruchpatrick managingthedatadelugedatadrivengocategoryassignmentimproveswhilecomplexityoffunctionalannotationincreases

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

Ejemplares similares