Cargando…

The TREC 2004 genomics track categorization task: classifying full text biomedical documents

BACKGROUND: The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categor...

Descripción completa

Detalles Bibliográficos
Autores principales: Cohen, Aaron M, Hersh, William R
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1440303/
https://www.ncbi.nlm.nih.gov/pubmed/16722582
http://dx.doi.org/10.1186/1747-5333-1-4
_version_ 1782127316192722944
author Cohen, Aaron M
Hersh, William R
author_facet Cohen, Aaron M
Hersh, William R
author_sort Cohen, Aaron M
collection PubMed
description BACKGROUND: The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories. RESULTS: The track had 33 participating groups. The mean and maximum utility measure for the triage subtask was 0.3303, with a top score of 0.6512. No system was able to substantially improve results over simply using the MeSH term Mice. Analysis of significant feature overlap between the training and test sets was found to be less than expected. Sample coverage of GO terms assigned to papers in the collection was very sparse. Determining papers containing GO term evidence will likely need to be treated as separate tasks for each concept represented in GO, and therefore require much denser sampling than was available in the data sets. The annotation subtask had a mean F-measure of 0.3824, with a top score of 0.5611. The mean F-measure for the annotation plus evidence codes subtask was 0.3676, with a top score of 0.4224. Gene name recognition was found to be of benefit for this task. CONCLUSION: Automated classification of documents for GO annotation is a challenging task, as was the automated extraction of GO code hierarchies and evidence codes. However, automating these tasks would provide substantial benefit to biomedical curation, and therefore work in this area must continue. Additional experience will allow comparison and further analysis about which algorithmic features are most useful in biomedical document classification, and better understanding of the task characteristics that make automated classification feasible and useful for biomedical document curation. The TREC Genomics Track will be continuing in 2005 focusing on a wider range of triage tasks and improving results from 2004.
format Text
id pubmed-1440303
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14403032006-04-19 The TREC 2004 genomics track categorization task: classifying full text biomedical documents Cohen, Aaron M Hersh, William R J Biomed Discov Collab Research BACKGROUND: The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories. RESULTS: The track had 33 participating groups. The mean and maximum utility measure for the triage subtask was 0.3303, with a top score of 0.6512. No system was able to substantially improve results over simply using the MeSH term Mice. Analysis of significant feature overlap between the training and test sets was found to be less than expected. Sample coverage of GO terms assigned to papers in the collection was very sparse. Determining papers containing GO term evidence will likely need to be treated as separate tasks for each concept represented in GO, and therefore require much denser sampling than was available in the data sets. The annotation subtask had a mean F-measure of 0.3824, with a top score of 0.5611. The mean F-measure for the annotation plus evidence codes subtask was 0.3676, with a top score of 0.4224. Gene name recognition was found to be of benefit for this task. CONCLUSION: Automated classification of documents for GO annotation is a challenging task, as was the automated extraction of GO code hierarchies and evidence codes. However, automating these tasks would provide substantial benefit to biomedical curation, and therefore work in this area must continue. Additional experience will allow comparison and further analysis about which algorithmic features are most useful in biomedical document classification, and better understanding of the task characteristics that make automated classification feasible and useful for biomedical document curation. The TREC Genomics Track will be continuing in 2005 focusing on a wider range of triage tasks and improving results from 2004. BioMed Central 2006-03-14 /pmc/articles/PMC1440303/ /pubmed/16722582 http://dx.doi.org/10.1186/1747-5333-1-4 Text en Copyright © 2006 Cohen and Hersh; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Cohen, Aaron M
Hersh, William R
The TREC 2004 genomics track categorization task: classifying full text biomedical documents
title The TREC 2004 genomics track categorization task: classifying full text biomedical documents
title_full The TREC 2004 genomics track categorization task: classifying full text biomedical documents
title_fullStr The TREC 2004 genomics track categorization task: classifying full text biomedical documents
title_full_unstemmed The TREC 2004 genomics track categorization task: classifying full text biomedical documents
title_short The TREC 2004 genomics track categorization task: classifying full text biomedical documents
title_sort trec 2004 genomics track categorization task: classifying full text biomedical documents
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1440303/
https://www.ncbi.nlm.nih.gov/pubmed/16722582
http://dx.doi.org/10.1186/1747-5333-1-4
work_keys_str_mv AT cohenaaronm thetrec2004genomicstrackcategorizationtaskclassifyingfulltextbiomedicaldocuments
AT hershwilliamr thetrec2004genomicstrackcategorizationtaskclassifyingfulltextbiomedicaldocuments
AT cohenaaronm trec2004genomicstrackcategorizationtaskclassifyingfulltextbiomedicaldocuments
AT hershwilliamr trec2004genomicstrackcategorizationtaskclassifyingfulltextbiomedicaldocuments